Defining new POSIX-like character class names in C/C++ - c++

I'm currently working on a project in c++ to use regex as HTTP FRC rules. In the RFC 1945, Chapter 2.2 - Basic Rules there are the following rules:
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
CRLF = CR LF
LWS = [CRLF] 1*( SP | HT )
word = token | quoted-string
token = 1*<any CHAR except CTLs or tspecials>
tspecials = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
quoted-string = ( <"> *(qdtext) <"> )
qdtext = <any CHAR except <"> and CTLs, but including LWS>
What I'm interested in is the usage of character classes like [:digit:] or at least recycling the regex. The pseudo-code would become something like this (\ is already escaped and regex is already in string form):
CHAR = "\\x00-\\x7F"
CTL = "\\x00-\\x19\\x7F"
CR = "\\r"
LF = "\\n"
SP = "\\x20"
HT = "\\t"
//here I start "recycling" old regexes
CRLF = "[:CR:][:LF:]"
LWS = "[:CRLF:]* ( [:SP:] | [:HT:] )+"
//here a declaration might happen before using the token or quoted_string class
word = "[:token:] | [:quoted_string:]"
token = "( (?= [[:CHAR:]] ) [^[:tspecial:][:CTL:]] )+"
tspecials = "()<>#,;:\\\\\\"/\\[\\]?={}[:SP:][:HT:]"
quoted_string = " ( \\" ([:qdtext:])* \\" ) "
//Little trick to allow LWS but not CTLs: https://stackoverflow.com/a/18017758/9373031
qdtext = "(?=[[:CHAR:]]) ( [:LWS:] | [^\\"[:CTL:]] )"
What I tried so far is to store them as string and then chain them together with a +, but looked ugly and not very optimized. Of course I could repeat some regexs but it started becoming an enormous monster the further I went.
I tried googling a while, but nor did I find anything about adding custom POSIX-like classes, neither did I find anything about recycling (and optimizing?) regexs.
What I need to do is to optimize and prettify regex originating string such that they could be parsed into a new one as POSIX-like classes or in some other way (code in C/C++):
std::regex CR ("\\r");
std::regex LF ("\\r");
std::regex CRLF ("[:CR:] [:LF:]");
Option 1:
[:CR:] [:LF:] would be expanded to \\r \\n and at compilation would become: std::regex CRLF ("\\r \\n");
Option 2:
[:CR:] [:LF:] would be "expanded" as "two functions" to optimize regex at run-time.
So far I found std::ctype_base has the static methods used for classnames in the std::regex_traits<CharT>::lookup_classname function, that should be used for finding defined classnames: is it possible to extend the masks used?

You need a kind of a metalanguage and some compiler for it. It is not a task for just C++ preprocessor or/and compiler's constant folding or other compile-stage features.
With the metalanguage you will describe your variant of extended RE. Then your compiler will parse that and generate some input for the main project - either just a set of strings to be used as input for the conventional RE, or something more smart and complex.
Tools for your task do exist: http://www.nongnu.org/bnf/, flex/bison, etc. They allow you not only to produce just some set of RE-strings, but to create the whole parser for your metalanguage (you have asked for optimization) - if such a concept is allowed for your project.
Or you can write your own parser from scratch.

Related

perl6 regex: match all punctuations except . and "

I read some threads on matching "X except Y", but none specific to perl6. I am trying to match and replace all punctuation except . and "
> my $a = ';# -+$12,678,93.45 "foo" *&';
;# -+$12,678,93.45 "foo" *&
> my $b = $a.subst(/<punct - [\.\"]>/, " ", :g);
===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
------> my $b = $a.subst(/<punct⏏ - [\.\"]>/, " ", :g);
Unrecognized regex metacharacter (must be quoted to match literally)
------> my $b = $a.subst(/<punct -⏏ [\.\"]>/, " ", :g);
Unable to parse expression in metachar:sym<assert>; couldn't find final '>' (corresponding starter was at line 1)
------> my $b = $a.subst(/<punct - ⏏[\.\"]>/, " ", :g);
> my $b = $a.subst(/<punct-[\.\"]>/, " ", :g);
===SORRY!=== Error while compiling:
Unable to parse expression in metachar:sym<assert>; couldn't find final '>' (corresponding starter was at line 1)
------> my $b = $a.subst(/<punct⏏-[\.\"]>/, " ", :g);
expecting any of:
argument list
term
> my $b = $a.subst(/<punct>-<[\.\"]>/, " ", :g);
===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
------> my $b = $a.subst(/<punct>⏏-<[\.\"]>/, " ", :g);
Unable to parse regex; couldn't find final '/'
------> my $b = $a.subst(/<punct>-⏏<[\.\"]>/, " ", :g);
> my $b = $a.subst(/<- [\.\"] + punct>/, " ", :g); # $b is blank space, not want I want
> my $b = $a.subst(/<[\W] - [\.\"]>/, " ", :g);
12 678 93.45 "foo"
# this works, but clumsy; I want to
# elegantly say: punctuations except \, and \"
# using predefined class <punct>;
What is the best approach?
I think the most natural solution is to use a "character class arithmetic expression". This entails using + and - prefixes on any number of either Unicode properties or [...] character classes:
#;# -+$12,678,93.45 "foo" *&
<+:punct -[."]> # +$12 678 93.45 "foo"
This can be read as "the class of characters that have the Unicode property punct minus the . and " characters".
Your input string includes + and $. These are not considered "punctuation" characters. You could explicitly add them to the set of characters being replaced by spaces:
<:punct +[+$] -[."] > # 12 678 93.45 "foo"
(I've dropped the initial + before :punct. If you don't write a + or - for the first item in a character class arithmetic expression then + is assumed.)
There's a Unicode property that covers all "symbols" including + and $ so you could use that instead:
<:punct +:symbol -[."] > # 12 678 93.45 "foo"
To recap, you can combine any number of:
Unicode properties like :punct that start with a : and correspond to some character property specified by Unicode; or
[...] character classes that enumerate specific characters, backslash character classes (like \d), or character ranges (eg a..z).
If an overall <...> assertion is to be a character class arithmetic expression then the first character after the opening < must be one of four characters:
: introducing a Unicode property (eg <:punct ...>);
[ introducing a [...] character class (eg <[abc ...>);
+ or a -. This may be followed by spaces. It must then be followed by either a Unicode property (:foo) or a [...] character class (eg <+ :punct ...>).
Thereafter each additional property or character class in the same overall character class arithmetic expression must be preceded by a + or - with or without additional spaces (eg <:punct - [."] ...>).
You can group sub-expressions in parentheses.
I'm not sure what the precise semantics of + and - are. I note this surprising result:
say $a.subst(/<-[."] +:punct>/, " ", :g); # substitutes ALL characters!?!
Built ins of the form <...> are not accepted in character class arithmetic expressions.
This is true even if they're called "character classes" in the doc. This includes ones that are nothing like a character class (eg <ident> is called a character class in the doc even though it matches a string of multiple characters which string matches a particular pattern!) but also ones that seem like they are character classes like <punct> or <digit>. (Many of these latter correspond directly to Unicode properties so you just use those instead.)
To use a backslash "character class" like \d in a character class arithmetic expression using + and - arithmetic you must list it within a [...] character class.
Combining assertions
While <punct> can't be combined with other assertions using character class arithmetic it can be combined with other regex constructs using the & regex conjunction operator:
<punct> & <-[."]> # +$12 678 93.45 "foo"
Depending on the state of compiler optimization (and as of 2019 there's been almost no effort applied to the regex engine), this will be slower in general than using real character classes.

EBNF for capturing a comma between two optional values

I have two optional values, and when both are present, a comma needs to be in between them. If one or both values are present, there may be a trailing comma, but if no values are present, no comma is allowed.
Valid examples:
(first,second,)
(first,second)
(first,)
(first)
(second,)
(second)
()
Invalid examples:
(first,first,)
(first,first)
(second,second,)
(second,second)
(second,first,)
(second,first)
(,first,second,)
(,first,second)
(,first,)
(,first)
(,second,)
(,second)
(,)
(,first,first,)
(,first,first)
(,second,second,)
(,second,second)
(,second,first,)
(,second,first)
I have EBNF code (XML-flavored) that suffices, but is there a way I can simplify it? I would like to make it more readable / less repetitive.
tuple ::= "(" ( ( "first" | "second" | "first" "," "second" ) ","? )? ")"
If it’s easier to understand in regex, here’s the equivalent code, but I need a solution in EBNF.
/\(((first|second|first\,second)\,?)?\)/
And here’s a helpful railroad diagram:
This question becomes even more complex when we abstract it to three terms: "first", "second", and "third" are all optional, but they must appear in that order, separated by commas, with an optional trailing comma. The best I can come up with is a brute-force method:
"(" (("first" | "second" | "third" | "first" "," "second" | "first" "," "third" | "second" "," "third" | "first" "," "second" "," "third") ","?)? ")"
Clearly, a solution involving O(2n) complexity is not very desirable.
I found a way to simplify it, but not by much:
"(" ( ("first" ("," "second")? | "second") ","? )? ")"
For the three-term solution, take the two-term solution and prepend a first term:
"(" (("first" ("," ("second" ("," "third")? | "third"))? | "second" ("," "third")? | "third") ","?)? ")"
For any (n+1)-term solution, take the n-term solution and prepend a first term. This complexity is O(n), which is significantly better than O(2n).
This expression might help you to maybe design a better expression. You can do this with only using capturing groups and swipe from left to right and pass your possible inputs, maybe similar to this:
\((first|second|)(,|)(second|)([\)|,]+)
I'm just guessing that you wish to capture the middle comma:
This may not be the exact expression you want. However, it might show you how this might be done in a simple way:
^(?!\(,)\((first|)(,|)(second|)([\)|,]+)$
You can add more boundaries to the left and right of your expression, maybe similar to this expression:
This graph shows how the second expression would work:
Performance
This JavaScript snippet shows the performance of the second expression using a simple 1-million times for loop, and how it captures first and second using $1 and $3.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = "(first,second,)";
var regex = /^(?!\(,)\((first|second|)(,|)(second|)([\)|,]+)$/gms;
var match = string.replace(regex, "$1 and $3");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");
I'm not familiar with EBNF but I am familiar with BNF and parser grammars. The following is just a variation of what you have based on my own regex. I am assuming the unquoted parens are not considered tokens and are used to group related elements.
tuple ::= ( "(" ( "first,second" | "first" | "second" ) ","? ")" ) | "()"
It matches on either (first,second or (first or (second
Then it matches on an optional ,
Followed by a closing parens. )
or the empty parens grouping. ()
But I doubt this is an improvement.
Here is my Java test code. The first two lines of strings in the test data match. The others do not.
String[] testdata = {
"(first,second,)", "(first,second)", "(first,)", "(first)",
"(second,)", "(second)", "()",
"(first,first,)", "(first,first)", "(second,second,)",
"(second,second)", "(second,first,)", "(second,first)",
"(,first,second,)", "(,first,second)", "(,first,)", "(,first)",
"(,second,)", "(,second)", "(,)", "(,first,first,)",
"(,first,first)", "(,second,second,)", "(,second,second)",
"(,second,first,)", "(,second,first)"
};
String reg = "\\(((first,second)|first|second),?\\)|\\(\\)";
Pattern p = Pattern.compile(reg);
for (String t : testdata) {
Matcher m = p.matcher(t);
if (m.matches()) {
System.out.println(t);
}
}

Regex for matching C++ string constant

I'm currently working on a C++ preprocessor and I need to match string constants with more than 0 letters like this "hey I'm a string.
I'm currently working with this one here \"([^\\\"]+|\\.)+\" but it fails on one of my test cases.
Test cases:
std::cout << "hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
Expected output:
std::cout << String("hello") << String(" world");
std::cout << String("He said: \"bananas\"") << String("...");
std::cout << "";
std::cout << String("\x12\23\x34");
On the second one I instead get
std::cout << String("He said: \")bananas\"String(" << ")...";
Short repro code (using the regex by AR.3):
std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex r("\"([^\"]+|\\.|(?<=\\\\)\")+\"");
in_line = std::regex_replace(in_line, r, "String($&)");
Lexing a source file is a good job for regexes. But for such a task, let's use a better regex engine than std::regex. Let's use PCRE (or boost::regex) at first. At the end of this post, I'll show what you can do with a less feature-packed engine.
We only need to do partial lexing, ignoring all unrecognized tokens that won't affect string literals. What we need to handle is:
Singleline comments
Multiline comments
Character literals
String literals
We'll be using the extended (x) option, which ignores whitespace in the pattern.
Comments
Here's what [lex.comment] says:
The characters /* start a comment, which terminates with the characters */. These comments do not nest.
The characters // start a comment, which terminates immediately before the next new-line character. If
there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear
between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment
characters //, /*, and */ have no special meaning within a // comment and are treated just like other
characters. Similarly, the comment characters // and /* have no special meaning within a /* comment.
— end note ]
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
Easy peasy. If you match anything there, just (*SKIP)(*FAIL) - meaning that you throw away the match. The (?s: .*? ) applies the s (singleline) modifier to the . metacharacter, meaning it's allowed to match newlines.
Character literals
Here's the grammar from [lex.ccon]:
character-literal:
encoding-prefix(opt) ’ c-char-sequence ’
encoding-prefix:
one of u8 u U L
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except the single-quote ’, backslash \, or new-line character
escape-sequence
universal-character-name
escape-sequence:
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
simple-escape-sequence: one of \’ \" \? \\ \a \b \f \n \r \t \v
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
hexadecimal-escape-sequence:
\x hexadecimal-digit
hexadecimal-escape-sequence hexadecimal-digit
Let's define a few things first, which we'll need later on:
(?(DEFINE)
(?<prefix> (?:u8?|U|L)? )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
prefix is defined as an optional u8, u, U or L
escape is defined as per the standard, except that I've merged universal-character-name into it for the sake of simplicity
Once we have these, a character literal is pretty simple:
(?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
We throw it away with (*SKIP)(*FAIL)
Simple strings
They're defined in almost the same way as character literals. Here's a part of [lex.string]:
string-literal:
encoding-prefix(opt) " s-char-sequence(opt) "
encoding-prefix(opt) R raw-string
s-char-sequence:
s-char
s-char-sequence s-char
s-char:
any member of the source character set except the double-quote ", backslash \, or new-line character
escape-sequence
universal-character-name
This will mirror the character literals:
(?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
The differences are:
The character sequence is optional this time (* instead of +)
The double quote is disallowed when unescaped instead of the single quote
We actually don't throw it away :)
Raw strings
Here's the raw string part:
raw-string:
" d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) "
r-char-sequence:
r-char
r-char-sequence r-char
r-char:
any member of the source character set, except a right parenthesis )
followed by the initial d-char-sequence (which may be empty) followed by a double quote ".
d-char-sequence:
d-char
d-char-sequence d-char
d-char:
any member of the basic source character set except:
space, the left parenthesis (, the right parenthesis ), the backslash \,
and the control characters representing horizontal tab,
vertical tab, form feed, and newline.
The regex for this is:
(?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
[^ ()\\\t\x0B\r\n]* is the set of characters that are allowed in delimiters (d-char)
\k<delimiter> refers to the previously matched delimiter
The full pattern
The full pattern is:
(?(DEFINE)
(?<prefix> (?:u8?|U|L)? )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
# character literal
| (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
# standard string
| (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
# raw string
| (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
See the demo here.
boost::regex
Here's a simple demo program using boost::regex:
#include <string>
#include <iostream>
#include <boost/regex.hpp>
static void test()
{
boost::regex re(R"regex(
(?(DEFINE)
(?<prefix> (?:u8?|U|L) )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
# character literal
| (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
# standard string
| (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* "
# raw string
| (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
)regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize);
std::string subject(R"subject(
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";
"" // empty string
'"' // character literal
// this is "a string literal" in a comment
/* this is
"also inside"
//a comment */
// and this /*
"is not in a comment"
// */
"this is a /* string */ with nested // comments"
)subject");
std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl;
}
int main(int argc, char **argv)
{
try
{
test();
}
catch(std::exception ex)
{
std::cerr << ex.what() << std::endl;
}
return 0;
}
(I left syntax highlighting disabled because it goes nuts on this code)
For some reason, I had to take the ? quantifier out of prefix (change (?<prefix> (?:u8?|U|L)? ) to (?<prefix> (?:u8?|U|L) ) and (?&prefix) to (?&prefix)?) to make the pattern work. I believe it's a bug in boost::regex, as both PCRE and Perl work just fine with the original pattern.
What if we don't have a fancy regex engine at hand?
Note that while this pattern technically uses recursion, it never nests recursive calls. Recursion could be avoided by inlining the relevant reusable parts into the main pattern.
A couple of other constructs can be avoided at the price of reduced performance. We can safely replace the atomic groups (?>...) with normal groups (?:...) if we don't nest quantifiers in order to avoid catastrophic backtracking.
We can also avoid (*SKIP)(*FAIL) if we add one line of logic into the replacement function: All the alternatives to skip are grouped in a capturing group. If the capturing group matched, just ignore the match. If not, then it's a string literal.
All of this means we can implement this in JavaScript, which has one of the simplest regex engines you can find, at the price of breaking the DRY rule and making the pattern illegible. The regex becomes this monstrosity once converted:
(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"
And here's an interactive demo you can play with:
function run() {
var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g;
var input = document.getElementById("input").value;
var output = input.replace(re, function(m, ignore) {
return ignore ? m : "String(" + m + ")";
});
document.getElementById("output").innerText = output;
}
document.getElementById("input").addEventListener("input", run);
run();
<h2>Input:</h2>
<textarea id="input" style="width: 100%; height: 50px;">
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";
"" // empty string
'"' // character literal
// this is "a string literal" in a comment
/* this is
"also inside"
//a comment */
// and this /*
"is not in a comment"
// */
"this is a /* string */ with nested // comments"
</textarea>
<h2>Output:</h2>
<pre id="output"></pre>
Regular expressions can be tricky for beginners but once you understand it's basics and well tested divide and conquer strategy, it will be your goto tool.
What you need to search for quote (") not starting with () back slash and read all characters upto next quote.
The regex I came up is (".*?[^\\]"). See a code snippet below.
std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex re(R"((".*?[^\\]"))");
in_line = std::regex_replace(in_line, re, "String($1)");
std::cout << in_line << endl;
Output:
std::cout << String("He said: \"bananas\"") << String("...");
Regex Explanation:
(".*?[^\\]")
Options: Case sensitive; Numbered capture; Allow zero-length matches; Regex syntax only
Match the regex below and capture its match into backreference number 1 (".*?[^\\]")
Match the character “"” literally "
Match any single character that is NOT a line break character (line feed, carriage return) .*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match any character that is NOT the backslash character [^\\]
Match the character “"” literally "
String($1)
Insert the character string “String” literally String
Insert an opening parenthesis (
Insert the text that was last matched by capturing group number 1 $1
Insert a closing parenthesis )
Read the relevant sections from the C++ standard, they are called lex.ccon and lex.string.
Then convert each rule you find there into a regular expression (if you really want to use regular expressions; it might turn out that they are not capable of doing this job).
Then, build more complicated regular expressions out of them. Be sure to name your regular expressions exactly as the rules from the C++ standard, so that you can recheck them later.
If, instead of using regular expressions, you want to use an existing tool, here is one: http://clang.llvm.org/doxygen/Lexer_8cpp_source.html. Have a look at the LexStringLiteral function.

Parse arbitrary delimiter character using Antlr4

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?
You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

regex remove punct removes non-punctuation characters in R

While filtering and cleaning text in Hebrew, I found that
gsub("[[:punct:]]", "", txt)
actually removes a relevant character. The character is "ק" and it is located in the "E" spot on the keyboard. Interestingly, the gsub function in R removes the "ק" character and then all words get messed up. Does anyone have an idea why?
According to Regular Expressions as used in R:
Certain named classes of characters are predefined. Their
interpretation depends on the locale (see locales); the interpretation
below is that of the POSIX locale.
Acc. to POSIX locale, [[:punct:]]should capture ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~. So, you might need to adjust your regex to remove only the characters you want:
txt <- "!\"#$%&'()*+,\\-./:;<=>?#[\\\\^\\]_`{|}~"
gsub("[\\\\!\"#$%&'()*+,./:;<=>?#[\\^\\]_`{|}~-]", "", txt, perl = T)
Sample program output:
[1] ""