Is plus (+) part of basic regular expressions? - regex

Recently I was told, that + (one or more occurrence of the previous pattern/character) is not part of basic regex. Not even when written as \+.
It was on a question about maximum compatibility.
I was under the impression that ...
echo "Hello World, I am an example-text" | sed 's#[^a-z0-9]\+#.#ig'
... always results in:
Hello.World.I.am.an.example.text
But then I was told that "it replaces every character not lowercase or a digit followed by + " and that it is the same as [^a-z0-9][+].
So my real question: is there any regex definition or implementation that does not treat either x+ or x\+ the same as xx*.

POSIX "basic" regular expressions do not support + (nor ?!). Most implementations of sed add support for \+ but it's not a POSIX standard feature. If your goal is maximum portability you should avoid using it. Notice that you have to use \+ rather than the more common +.
echo "Hello World, I am an example-text" | sed 's#[^a-z0-9]\+#.#ig'
The -E flag enables "extended" regular expressions, which are a lot closer to the syntax used in Perl, JavaScript, and most other modern regex engines. With -E you don't need to have a backslash; it's simply +.
echo "Hello World, I am an example-text" | sed -E 's#[^a-z0-9]+#.#ig'
From https://www.regular-expressions.info/posix.html:
POSIX or "Portable Operating System Interface for uniX" is a collection of standards that define some of the functionality that a (UNIX) operating system should support. One of these standards defines two flavors of regular expressions. Commands involving regular expressions, such as grep and egrep, implement these flavors on POSIX-compliant UNIX systems. Several database systems also use POSIX regular expressions.
The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX grep command. This is pretty much the oldest regular expression flavor still in use today. One thing that sets this flavor apart is that most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including POSIX ERE, use a backslash to suppress the meaning of metacharacters. Using a backslash to escape a character that is never a metacharacter is an error.
A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special features. Shorthands are not supported. Other features using the usual metacharacters are the dot to match any character except a line break, the caret and dollar to match the start and end of the string, and the star to repeat the token zero or more times. To match any of these characters literally, escape them with a backslash.
The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions, which may use these characters as literal characters. The BRE a{1,2} matches a{1,2} literally, while a\{1,2\} matches a or aa. Some implementations support \? and \+ as an alternative syntax to \{0,1\} and \{1,\}, but \? and \+ are not part of the POSIX standard. Tokens can be grouped with \( and \). Backreferences are the usual \1 through \9. Only up to 9 groups are permitted. E.g. \(ab\)\1 matches abab, while (ab)\1 is invalid since there's no capturing group corresponding to the backreference \1. Use \\1 to match \1 literally.
POSIX BRE does not support any other features. Even alternation is not supported.
(Emphasis mine.)
So my real question: is there any regex definition or implementation that does not treat either x+ or x\+ the same as xx*.
I can't think of any real world language or tool that supports neither + nor \+.
In the formal mathematical definition of regular expressions there are commonly only three operations defined:
Concatenation: AB matches A followed by B.
Alternation: A|B matches either A or B.
Kleene star: R* matches 0 or more repetitions of R.
These three operations are enough to give the full expressive power of regular expressions†. Operators like ? and + are convenient in programming but not necessary in a mathematical context. If needed, they are defined in terms of the others: R? is R|ε and R+ is RR*.
† Mathematically speaking, that is. Features like back references and lookahead/lookbehind don't exist in formal language theory. Those features add additional expressive power not available in mathematical definitions of regular expressions.

In some traditional sed implementations, you have to enable "extended" regular expressions to get support for + to mean "one or more."
For evidence of this, see: sed plus sign doesn't work

Related

Why is POSIX collating-related bracketed symbol higher-precedence than backslash?

POSIX, aka "The Open Group Base Specifications Issue 7, 2018 edition", has this to say about regular expression operator precedence:
9.4.8 ERE Precedence
The order of precedence shall be as shown in the following table:
ERE Precedence (from high to low)
Collation-related bracket symbols
[==] [::] [..]
Escaped characters
\ special-character
Bracket expression
[]
Grouping
()
Single-character-ERE duplication
* + ? {m,n}
Concatenation
ab
Anchoring
^ $
Alternation
|
I am curious as to the reason for the first two levels being in that order. Being a unix user from way back, I am accustomed to being able to "throw a backslash in front of it" to escape virtually anything. But it appears that with Collation-Related-Bracket-Symbols (CRBS), I can't do that. If I want to match a literal [.ch.] I can't just type \[.ch.] and rely on "dot matches dot" to handle things for me. I now have to match something like [[].ch.] (or possibly worse?).
I'm trying, and failing, to imagine what the scenario was when whoever-thought-this-up decided this should be the order. Is there a concrete scenario where having CRBS ranked higher than backslash makes sense, or was this a case of "we don't understand CRBS yet so let's make it higher priority" or ... what, exactly?
At least for Gnu grep, it looks like lib/dfa.c treats the CRBS as one lexical token, as per the function parse_bracket_exp().
For the example given, escaping the special characters (square brackets and dots) seems to give the results you are looking for. You can also match literal dots with [.] which might be easier to see in a regular expression.
$ (echo c;echo '[.ch.]';echo .ch.;echo xchx)|grep '\[\.ch\.\]'
[.ch.]

Transform Regexp to POSIX BRE

I'd like to put this expression into POSIX BRE.
HTTP\/[\d.]+.\s+(?:403)\s+(4[0-9])\s+
Here is what I've come up with so far.
HTTP\/[0-9.]{1,}.[[:blank:]]{1,}403[[:blank:]]{1,}(4[0-9])[[:blank:]]
Using a web based regex checker, both examples work quite well.
This regexp needs to be registered in SCOM however and it seems like it only supports POSIX BRE for monitoring Linux servers.
Here's the Posix documentation on Basic Regular Expressions. In particular, note:
When a BRE matching a single character, a subexpression, or a back-reference is followed by an interval expression of the format \{m\}, \{m,\}, or \{m,n\}, together with that interval expression it shall match what repeated consecutive occurrences of the BRE would match…
So [[:blank:]]{1,} isn't going to do what you think it will; the braces need to be preceded with backslashes.
On the other hand, most BRE implementations do allow you to use \+ to mean "one or more repetitions". At least, the BSD and Gnu varieties do. So you might well be able to write that as [[:blank:]]\+ instead of using the numeric repetition operator [[:blank:]]\{1,\}.
Finally, [[:blank:]] might not be what you want. At least, it doesn't match the same thing as \s does. [[:blank:]] matches only space and tab characters ([ \t]). But in most regex libraries, \s is the same as [ \t\r\n\f\v], which is what is matched by [[:space:]] in a C regex (or by the isspace() function in C code). The most visible difference between [[:blank:]] and \s (or [[:space:]]) is that [[:blank:]] does not match newlines. Perhaps that's fine in your application.
Pedantic note: Some regex libraries define \s as [ \t\r\n\f], but you're unlikely to notice the difference. And all of those lists of characters assume that the regex has been compiled in the "C" locale. If the regex library is locale-aware and some other locale has been enabled, additional characters might match.

Sed escaping special chars

To make the sed to work with an alternation construction we must espace special chars like ( or |:
sed -n "/\(abc\|def\)/p"
Simple
sed -n "/(abc|def)/p"
doesn't work.
My question is: why does sed behaves contrariwise to the "normal" regex where we escape special chars to give them literal meaning?
What you call "normal" is a feature invented by Perl.
All traditional regex engines (e.g. the ones used by grep, sed, emacs, awk) have some special characters that match literally when escaped and normal characters that get a special meaning when escaped. My best guess for why this happened is evolution: Maybe the first implementation of regexes only supported [, ], and *, and everything else was matched literally. To introduce new features while keeping compatibility, the escaped syntax (\(, \), etc.) was invented.
Later on, other tools just copied the existing syntax.
As far as I know, Perl was the first language to make regex syntax more, well, regular:
All alphanumeric characters match themselves.
Escaping an alphanumeric character may have a special meaning (e.g. \n, \1, \z).
Punctuation characters may have a special meaning (e.g. (, +, ?).
Escaping a non-alphanumeric character always makes it match literally, even if it wasn't special in the first place (e.g. \:, \").
All "modern" regex engines (e.g. the ones used in JavaScript or .NET) copied Perl's behavior.

Regular expression [:digit:] and [[:digit]] difference

I am currently learning regular expression. I met a problem that I can't find answer on stackoverflow and I hope someone can help find the answer.
I use vim in mac OS system and vim shows the line.
If the file "regular_expression.txt" is:
"Open Source" is a good mechanism to develop programs.
You are the No. 1.
Then
grep -n '[:lower:]' regular_expression.txt
will return
1:"Open Source" is a good mechanism to develop programs.
2:You are the No. 1.
The command
grep '[[:lower:]]' regular_expression.txt
will return
2:You are the No. 1.
I can't understand the above difference because it seems to me that [:lower:] is a set of lower characters. [[:lower:]] should be the same as [:lower:]. It is also confusing that in the first case where [:lower:] is used, all the lines in the file are returned.
POSIX character classes must be wrapped in bracket expressions.
The [:lower:] pattern is a positive bracket expression that matches a single char, :, l, o, w, e or r.
The [[:lower:]] pattern is a positive bracket expression that matches any char that is matched with the [:lower:] POSIX character class (that matches any lowercase letters).
See grep manual:
certain named classes of characters are predefined within bracket expressions... Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket expression.
If you mistakenly omit the outer brackets, and search for say, [:upper:], GNU grep prints a diagnostic and exits with status 2, on the assumption that you did not intend to search for the nominally equivalent regular expression: [:epru].

Convert Perl regular expression to equivalent ECMAScript regular expression

Now I'm using VC++ 2010, but the syntax_option_type of VC++ 2010 only contains the following options:
static const flag_type icase = regex_constants::icase;
static const flag_type nosubs = regex_constants::nosubs;
static const flag_type optimize = regex_constants::optimize;
static const flag_type collate = regex_constants::collate;
static const flag_type ECMAScript = regex_constants::ECMAScript;
static const flag_type basic = regex_constants::basic;
static const flag_type extended = regex_constants::extended;
static const flag_type awk = regex_constants::awk;
static const flag_type grep = regex_constants::grep;
static const flag_type egrep = regex_constants::egrep;
It doesn't contain perl_syntax_group(Boost Library has the option). However, I don't want to use the Boost Library.
There are many regular expression written in Perl, So I want to convert the existing Perl regular expressions to ECMAScript(or any one that VC++ 2010 support). After conversion I can use the equivalent regular expressions directly in VC++ 2010 without using the third party libray.
One example:
const boost::tregex e(__T("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"));
const CString human_format = __T("$1-$2-$3-$4");
CString human_readable_card_number(const CString& s)
{
return boost::regex_replace(s, e, human_format);
}
CString credit_card_number = "1234567887654321";
credit_card_number = human_readable_card_number(credit_card_number);
assert(credit_card_number == "1234-5678-8765-4321");
In the above example, what I want to do is convert e and format to ECMAScript style expressions.
Is it possible to find a general way to convert all Perl regular expressions to ECMAScript style?
Are there some tools to do this?
Any help will be appreciated!
For the particular regex you want to convert, the equivalent in ECMA regex is:
/^(\d{3,4})[- ]?(\d{4})[- ]?(\d{4})[- ]?(\d{4})$/
In this case, \A (in Perl regex) has the same meaning as ^ (in ECMA regex) (matching beginning of the string) and \Z (in Perl regex) has the same meaning as $ (in ECMA regex) (matching the end of the string). Note that meaning of ^ and $ in ECMA regex will change to matching the beginning and the end of the line if you enable multiline mode.
ECMA regex is a subset of Perl regex, so if the regex uses exclusive features in Perl regex, it is likely that it is not convertible to ECMA regex. Even for same syntax, the syntax may mean slightly different thing between 2 dialects of regex, so it is always wise to check the documentation and compare the usage.
I'm only going to say what is similar between ECMA regex and Perl regex. What is not similar, but convertible, I will mention it to the most of my ability.
ECMA regex is lacking on features to work with Unicode, which compels you to look up the code points and specify them as character classes.
Going according to the documentation for Perl regular expression:
Modifiers:
Only i, g, m are in ECMA Standard, and they behave the same as in Perl.
s dot-all modifier can be simulated in ECMA regex by using 2 complementing character classes e.g. [\S\s], [\D\d]
No support in anyway for x and p flag.
I don't know if there is anyway to simulate the rest (prefix and suffix modifiers).
Meta characters:
I have a bit of doubt about using \ with non-meta character that doesn't resolve to any special meaning, but it should be fine if you don't escape where you don't need to. . in ECMA excludes a few more characters. The rest behaves the same in ECMA regex (even effect of m flag on ^ and $).
Quantifier:
Greedy and Lazy behavior should be the same. There is no possessive behavior in ECMA regex.
Escape sequences:
There's no \a and \e in ECMA regex. \t, \n, \r, \f are the same.
Check the documentation if the regex has \cX - there are differences.
\xhh is common in ECMA regex and Perl regex (specifying 2 hexadecimal digits is the safest - otherwise, you will have to look up the documentation to see how the language will deal with the case where there are less than 2 hexadecimal digits).
\uhhhh is ECMA regex exclusive feature to specify Unicode character. Perl has other exclusive ways to specify character such as \x{}, \N{}, \o{}, \000.
\l, \u, \L, \U are exclusive to Perl regex.
\Q and \E can be simulated by escaping the quoted section by hand.
Octal escape (which has less than 3 octal digits) in Perl regex may be confusing. Check the context carefully, read the documentation, and/or test the regex to make sure you understand what it is doing in context, since it might be either escaped sequence or back reference.
Character classes and other special escapes:
\w, \W, \s, \S, \d, \D are equivalent in ECMA regex and Perl regex, if assuming US-ASCII. If Unicode is involved, things will be a bloody mess.
No POSIX character class in ECMA regex. Use the above \w, \s, \d or specify yourself in character class.
Back reference is mostly the same - but I don't know if it allows the back reference to go beyond 9 for both Perl and ECMA regex.
Named reference can be simulated with back reference.
The rest (except [] and already mentioned escaped sequences) are unsupported in ECMA regex.
Assertion:
\b and \B are equivalent in both languages, with regards to how they are defined based on \w.
Capture groups: Grouping () and back reference are the same. $n, which is used in the replacement string to back reference to matched text, is the same. The rest in the section are Perl exclusive features.
Quoting meta-characters: (Content already mentioned in previous sections).
Extended Pattern:
ECMA regex doesn't support modification of flags inside regex. Depending on what the flags are, you may be able to rewrite the regex (s flag is one that can always be converted to equivalent expression in ECMA regex).
Only (?:pattern) (non-capturing group), (?=pattern) (positive look ahead), (?!pattern) (negative look ahead) are common between Perl and ECMA.
There is no comment in ECMA regex, so (?#text) can be ignored.
Look-behinds are not supported in ECMA regex. Fixed-width look-behind is supported in Perl. In some cases, regex with positive look behind written in Perl can be converted to ECMA regex, by making the look-behind a capturing group.
As mentioned before, named pattern can be converted to normal capture group and can be referred to with numbered back reference.
The rest are Perl exclusive features.
Special Backtracking Control Verbs: This is Perl exclusive, and I have no idea what these do (never touched them before), let alone conversion. It's most likely the case that they are not convertible anyway.
Conclusion:
If the regex utilize the full power of Perl regex, or at the level which Boost library supports (e.g. recursive regex), it is not possible to convert the regex to ECMA regex. Fortunately, ECMA regex covers the most commonly used features, so it's likely that the regex are convertible.
Reference:
ECMA RegExp Reference on MDN