Convert Perl regular expression to equivalent ECMAScript regular expression - c++

Now I'm using VC++ 2010, but the syntax_option_type of VC++ 2010 only contains the following options:
static const flag_type icase = regex_constants::icase;
static const flag_type nosubs = regex_constants::nosubs;
static const flag_type optimize = regex_constants::optimize;
static const flag_type collate = regex_constants::collate;
static const flag_type ECMAScript = regex_constants::ECMAScript;
static const flag_type basic = regex_constants::basic;
static const flag_type extended = regex_constants::extended;
static const flag_type awk = regex_constants::awk;
static const flag_type grep = regex_constants::grep;
static const flag_type egrep = regex_constants::egrep;
It doesn't contain perl_syntax_group(Boost Library has the option). However, I don't want to use the Boost Library.
There are many regular expression written in Perl, So I want to convert the existing Perl regular expressions to ECMAScript(or any one that VC++ 2010 support). After conversion I can use the equivalent regular expressions directly in VC++ 2010 without using the third party libray.
One example:
const boost::tregex e(__T("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"));
const CString human_format = __T("$1-$2-$3-$4");
CString human_readable_card_number(const CString& s)
{
return boost::regex_replace(s, e, human_format);
}
CString credit_card_number = "1234567887654321";
credit_card_number = human_readable_card_number(credit_card_number);
assert(credit_card_number == "1234-5678-8765-4321");
In the above example, what I want to do is convert e and format to ECMAScript style expressions.
Is it possible to find a general way to convert all Perl regular expressions to ECMAScript style?
Are there some tools to do this?
Any help will be appreciated!

For the particular regex you want to convert, the equivalent in ECMA regex is:
/^(\d{3,4})[- ]?(\d{4})[- ]?(\d{4})[- ]?(\d{4})$/
In this case, \A (in Perl regex) has the same meaning as ^ (in ECMA regex) (matching beginning of the string) and \Z (in Perl regex) has the same meaning as $ (in ECMA regex) (matching the end of the string). Note that meaning of ^ and $ in ECMA regex will change to matching the beginning and the end of the line if you enable multiline mode.
ECMA regex is a subset of Perl regex, so if the regex uses exclusive features in Perl regex, it is likely that it is not convertible to ECMA regex. Even for same syntax, the syntax may mean slightly different thing between 2 dialects of regex, so it is always wise to check the documentation and compare the usage.
I'm only going to say what is similar between ECMA regex and Perl regex. What is not similar, but convertible, I will mention it to the most of my ability.
ECMA regex is lacking on features to work with Unicode, which compels you to look up the code points and specify them as character classes.
Going according to the documentation for Perl regular expression:
Modifiers:
Only i, g, m are in ECMA Standard, and they behave the same as in Perl.
s dot-all modifier can be simulated in ECMA regex by using 2 complementing character classes e.g. [\S\s], [\D\d]
No support in anyway for x and p flag.
I don't know if there is anyway to simulate the rest (prefix and suffix modifiers).
Meta characters:
I have a bit of doubt about using \ with non-meta character that doesn't resolve to any special meaning, but it should be fine if you don't escape where you don't need to. . in ECMA excludes a few more characters. The rest behaves the same in ECMA regex (even effect of m flag on ^ and $).
Quantifier:
Greedy and Lazy behavior should be the same. There is no possessive behavior in ECMA regex.
Escape sequences:
There's no \a and \e in ECMA regex. \t, \n, \r, \f are the same.
Check the documentation if the regex has \cX - there are differences.
\xhh is common in ECMA regex and Perl regex (specifying 2 hexadecimal digits is the safest - otherwise, you will have to look up the documentation to see how the language will deal with the case where there are less than 2 hexadecimal digits).
\uhhhh is ECMA regex exclusive feature to specify Unicode character. Perl has other exclusive ways to specify character such as \x{}, \N{}, \o{}, \000.
\l, \u, \L, \U are exclusive to Perl regex.
\Q and \E can be simulated by escaping the quoted section by hand.
Octal escape (which has less than 3 octal digits) in Perl regex may be confusing. Check the context carefully, read the documentation, and/or test the regex to make sure you understand what it is doing in context, since it might be either escaped sequence or back reference.
Character classes and other special escapes:
\w, \W, \s, \S, \d, \D are equivalent in ECMA regex and Perl regex, if assuming US-ASCII. If Unicode is involved, things will be a bloody mess.
No POSIX character class in ECMA regex. Use the above \w, \s, \d or specify yourself in character class.
Back reference is mostly the same - but I don't know if it allows the back reference to go beyond 9 for both Perl and ECMA regex.
Named reference can be simulated with back reference.
The rest (except [] and already mentioned escaped sequences) are unsupported in ECMA regex.
Assertion:
\b and \B are equivalent in both languages, with regards to how they are defined based on \w.
Capture groups: Grouping () and back reference are the same. $n, which is used in the replacement string to back reference to matched text, is the same. The rest in the section are Perl exclusive features.
Quoting meta-characters: (Content already mentioned in previous sections).
Extended Pattern:
ECMA regex doesn't support modification of flags inside regex. Depending on what the flags are, you may be able to rewrite the regex (s flag is one that can always be converted to equivalent expression in ECMA regex).
Only (?:pattern) (non-capturing group), (?=pattern) (positive look ahead), (?!pattern) (negative look ahead) are common between Perl and ECMA.
There is no comment in ECMA regex, so (?#text) can be ignored.
Look-behinds are not supported in ECMA regex. Fixed-width look-behind is supported in Perl. In some cases, regex with positive look behind written in Perl can be converted to ECMA regex, by making the look-behind a capturing group.
As mentioned before, named pattern can be converted to normal capture group and can be referred to with numbered back reference.
The rest are Perl exclusive features.
Special Backtracking Control Verbs: This is Perl exclusive, and I have no idea what these do (never touched them before), let alone conversion. It's most likely the case that they are not convertible anyway.
Conclusion:
If the regex utilize the full power of Perl regex, or at the level which Boost library supports (e.g. recursive regex), it is not possible to convert the regex to ECMA regex. Fortunately, ECMA regex covers the most commonly used features, so it's likely that the regex are convertible.
Reference:
ECMA RegExp Reference on MDN

Related

Transform Regexp to POSIX BRE

I'd like to put this expression into POSIX BRE.
HTTP\/[\d.]+.\s+(?:403)\s+(4[0-9])\s+
Here is what I've come up with so far.
HTTP\/[0-9.]{1,}.[[:blank:]]{1,}403[[:blank:]]{1,}(4[0-9])[[:blank:]]
Using a web based regex checker, both examples work quite well.
This regexp needs to be registered in SCOM however and it seems like it only supports POSIX BRE for monitoring Linux servers.
Here's the Posix documentation on Basic Regular Expressions. In particular, note:
When a BRE matching a single character, a subexpression, or a back-reference is followed by an interval expression of the format \{m\}, \{m,\}, or \{m,n\}, together with that interval expression it shall match what repeated consecutive occurrences of the BRE would match…
So [[:blank:]]{1,} isn't going to do what you think it will; the braces need to be preceded with backslashes.
On the other hand, most BRE implementations do allow you to use \+ to mean "one or more repetitions". At least, the BSD and Gnu varieties do. So you might well be able to write that as [[:blank:]]\+ instead of using the numeric repetition operator [[:blank:]]\{1,\}.
Finally, [[:blank:]] might not be what you want. At least, it doesn't match the same thing as \s does. [[:blank:]] matches only space and tab characters ([ \t]). But in most regex libraries, \s is the same as [ \t\r\n\f\v], which is what is matched by [[:space:]] in a C regex (or by the isspace() function in C code). The most visible difference between [[:blank:]] and \s (or [[:space:]]) is that [[:blank:]] does not match newlines. Perhaps that's fine in your application.
Pedantic note: Some regex libraries define \s as [ \t\r\n\f], but you're unlikely to notice the difference. And all of those lists of characters assume that the regex has been compiled in the "C" locale. If the regex library is locale-aware and some other locale has been enabled, additional characters might match.

Sed escaping special chars

To make the sed to work with an alternation construction we must espace special chars like ( or |:
sed -n "/\(abc\|def\)/p"
Simple
sed -n "/(abc|def)/p"
doesn't work.
My question is: why does sed behaves contrariwise to the "normal" regex where we escape special chars to give them literal meaning?
What you call "normal" is a feature invented by Perl.
All traditional regex engines (e.g. the ones used by grep, sed, emacs, awk) have some special characters that match literally when escaped and normal characters that get a special meaning when escaped. My best guess for why this happened is evolution: Maybe the first implementation of regexes only supported [, ], and *, and everything else was matched literally. To introduce new features while keeping compatibility, the escaped syntax (\(, \), etc.) was invented.
Later on, other tools just copied the existing syntax.
As far as I know, Perl was the first language to make regex syntax more, well, regular:
All alphanumeric characters match themselves.
Escaping an alphanumeric character may have a special meaning (e.g. \n, \1, \z).
Punctuation characters may have a special meaning (e.g. (, +, ?).
Escaping a non-alphanumeric character always makes it match literally, even if it wasn't special in the first place (e.g. \:, \").
All "modern" regex engines (e.g. the ones used in JavaScript or .NET) copied Perl's behavior.

Is plus (+) part of basic regular expressions?

Recently I was told, that + (one or more occurrence of the previous pattern/character) is not part of basic regex. Not even when written as \+.
It was on a question about maximum compatibility.
I was under the impression that ...
echo "Hello World, I am an example-text" | sed 's#[^a-z0-9]\+#.#ig'
... always results in:
Hello.World.I.am.an.example.text
But then I was told that "it replaces every character not lowercase or a digit followed by + " and that it is the same as [^a-z0-9][+].
So my real question: is there any regex definition or implementation that does not treat either x+ or x\+ the same as xx*.
POSIX "basic" regular expressions do not support + (nor ?!). Most implementations of sed add support for \+ but it's not a POSIX standard feature. If your goal is maximum portability you should avoid using it. Notice that you have to use \+ rather than the more common +.
echo "Hello World, I am an example-text" | sed 's#[^a-z0-9]\+#.#ig'
The -E flag enables "extended" regular expressions, which are a lot closer to the syntax used in Perl, JavaScript, and most other modern regex engines. With -E you don't need to have a backslash; it's simply +.
echo "Hello World, I am an example-text" | sed -E 's#[^a-z0-9]+#.#ig'
From https://www.regular-expressions.info/posix.html:
POSIX or "Portable Operating System Interface for uniX" is a collection of standards that define some of the functionality that a (UNIX) operating system should support. One of these standards defines two flavors of regular expressions. Commands involving regular expressions, such as grep and egrep, implement these flavors on POSIX-compliant UNIX systems. Several database systems also use POSIX regular expressions.
The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX grep command. This is pretty much the oldest regular expression flavor still in use today. One thing that sets this flavor apart is that most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including POSIX ERE, use a backslash to suppress the meaning of metacharacters. Using a backslash to escape a character that is never a metacharacter is an error.
A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special features. Shorthands are not supported. Other features using the usual metacharacters are the dot to match any character except a line break, the caret and dollar to match the start and end of the string, and the star to repeat the token zero or more times. To match any of these characters literally, escape them with a backslash.
The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions, which may use these characters as literal characters. The BRE a{1,2} matches a{1,2} literally, while a\{1,2\} matches a or aa. Some implementations support \? and \+ as an alternative syntax to \{0,1\} and \{1,\}, but \? and \+ are not part of the POSIX standard. Tokens can be grouped with \( and \). Backreferences are the usual \1 through \9. Only up to 9 groups are permitted. E.g. \(ab\)\1 matches abab, while (ab)\1 is invalid since there's no capturing group corresponding to the backreference \1. Use \\1 to match \1 literally.
POSIX BRE does not support any other features. Even alternation is not supported.
(Emphasis mine.)
So my real question: is there any regex definition or implementation that does not treat either x+ or x\+ the same as xx*.
I can't think of any real world language or tool that supports neither + nor \+.
In the formal mathematical definition of regular expressions there are commonly only three operations defined:
Concatenation: AB matches A followed by B.
Alternation: A|B matches either A or B.
Kleene star: R* matches 0 or more repetitions of R.
These three operations are enough to give the full expressive power of regular expressions†. Operators like ? and + are convenient in programming but not necessary in a mathematical context. If needed, they are defined in terms of the others: R? is R|ε and R+ is RR*.
† Mathematically speaking, that is. Features like back references and lookahead/lookbehind don't exist in formal language theory. Those features add additional expressive power not available in mathematical definitions of regular expressions.
In some traditional sed implementations, you have to enable "extended" regular expressions to get support for + to mean "one or more."
For evidence of this, see: sed plus sign doesn't work

How can I convert a Perl regex to work with Boost::Regex?

What is the Boost::Regex equivalent of this Perl regex for words that end with ing or ed or en?
/ing$|ed$|en$/
...
The most important difference is that regexps in C++ are strings so all regexp specific backslash sequences (such as \w and \d should be double quoted ("\\w" and "\\d")
/^[\.:\,()\'\`-]/
should become
"^[.:,()'`-]"
The special Perl regex delimiter / doesn't exist in C++, so regexes are just a string. In those strings, you need to take care to escape backslashes correctly (\\ for every \ in your original regex). In your example, though, all those backslashes were unnecessary, so I dropped them completely.
There are other caveats; some Perl features (like variable-length lookbehind) don't exist in the Boost library, as far as I know. So it might not be possible to simply translate any regex. Your examples should be fine, though. Although some of them are weird. .*[0-9].* will match any string that contains a number somewhere, not all numbers.

Regular expression opening and closing characters

When I learned regular expressions I learned they should start and end with a slash character (followed by modifiers).
For example /dog/i
However, in many examples I see them starting and ending with other characters, such as #, #, and |.
For example |dog|
What's the difference?
This varies enormously from one regex flavor to the next. For example, JavaScript only lets you use the forward-slash (or solidus) as a delimiter for regex literals, but in Perl you can use just about any punctuation character--including, in more recent versions, non-ASCII characters like « and ». When you use characters that come in balanced pairs like braces, parentheses, or the double-arrow quotes above, they have to be properly balanced:
m«\d+»
s{foo}{bar}
Ruby also lets you choose different delimiters if you use the %r prefix, but I don't know if that extends to the balanced delimiters or non-ASCII characters. Many languages don't support regex literals at all; you just write the regexes as string literals, for example:
r'\d+' // Python
#"\d+" // C#
"\\d+" // Java
Note the double backslash in the Java version. That's necessary because the string gets processed twice: once by the Java compiler and once by the compile() method of the Pattern class. Most other languages provide a "raw" or "verbatim" form of string literal that all but eliminates such backslash-itis.
And then there's PHP. Its preg regex functions are built on top of the PCRE library, which closely imitates Perl's regexes, including the wide variety of delimiters. However, PHP itself doesn't support regex literals, so you have to write them as if they were regex literals embedded in string literals, like so:
'/\d+/g' // match modifiers go after the slash but inside the quotes
"{\\d+}" // double-quotes may or may not require double backslashes
Finally, note that even those languages which do support regex literals don't usually offer anything like Perl's s/…/…/ construct. The closest equivalent is a function call that takes a regex literal as the first argument and a string literal as the second, like so:
s = s.replace(/foo/i, 'bar') // JavaScript
s.gsub!(/foo/i, "bar") // Ruby
Some RE engines will allow you to use a different character so as to avoid having to escape those characters when used in the RE.
For example, with sed, you can use either of:
sed 's/\/path\/to\/directory/xx/g'
sed 's?/path/to/directory?xx?g'
The latter is often more readable. The former is sometimes called "leaning toothpicks". With Perl, you can use either of:
$x =~ /#!\/usr\/bin\/perl/;
$x =~ m!#\!/usr/bin/perl!;
but I still contend the latter is easier on the eyes, especially as the REs get very complex. Well, as easy on the eyes as any Perl code could be :-)