Regular expressions documentation while using grep - regex

I am trying to find some comprehensive documentation on character classes in regular expressions that could be used while using grep. I tried
info grep
man grep
man 7 regex
but could not find all the characters classes listed down in the documentation.
I am looking for some comprehensive documentation on regex that grep uses. Is there such a documentation available?

grep has three options for regex -E or --extended-regexp -G or --basic-regexp and -P or --perl-regexp.
Extended / Basic Regex Classes: Follow POSIX Classes
Perl Regex Classes: Follow Perl Classes
From the command line POSIX regex information can be accessed via man 7 regex where as Perl regex data can be accessed via perldoc perlre

http://linux.about.com/od/commands/l/blcmdl1_grep.htm
Finally, certain named classes of characters are predefined within
bracket expressions, as follows. Their names are self explanatory, and
they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:],
[:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].
For example, [[:alnum:]] means [0-9A-Za-z], except the latter form
depends upon the C locale and the ASCII character encoding, whereas
the former is independent of locale and character set.

When I do man grep, this is what I get:
REGULAR EXPRESSIONS
A regular expression is a pattern that describes a set of strings. Regular expressions are constructed
analogously to arithmetic expressions, by using various operators to combine smaller expressions.
grep understands two different versions of regular expression syntax: "basic" and "extended." In GNU grep,
there is no difference in available functionality using either syntax. In other implementations, basic
regular expressions are less powerful. The following description applies to extended regular expressions;
differences for basic regular expressions are summarized afterwards.
The fundamental building blocks are the regular expressions that match a single character. Most characters,
including all letters and digits, are regular expressions that match themselves. Any meta-character with
special meaning may be quoted by preceding it with a backslash.
The period . matches any single character.
Character Classes and Bracket Expressions
A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that
list; if the first character of the list is the caret ^ then it matches any character not in the list. For
example, the regular expression [0123456789] matches any single digit.
Within a bracket expression, a range expression consists of two characters separated by a hyphen. It
matches any single character that sorts between the two characters, inclusive, using the locale's collating
sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many
locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to
[abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of
bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.
Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their
names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:],
[:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z],
except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is
independent of locale and character set. (Note that the brackets in these class names are part of the
symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most
meta-characters lose their special meaning inside bracket expressions. To include a literal ] place it
first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a
literal - place it last.
Anchoring
The caret ^ and the dollar sign $ are meta-characters that respectively match the empty string at the
beginning and end of a line.
The Backslash Character and Special Expressions
The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b
matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the
edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].
Repetition
A regular expression may be followed by one of several repetition operators:
? The preceding item is optional and matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{,m} The preceding item is matched at most m times.
{n,m} The preceding item is matched at least n times, but not more than m times.
Concatenation
Two regular expressions may be concatenated; the resulting regular expression matches any string formed by
concatenating two substrings that respectively match the concatenated expressions.
Alternation
Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any
string matching either alternate expression.
Precedence
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole
expression may be enclosed in parentheses to override these precedence rules and form a subexpression.
Back References and Subexpressions
The back-reference \n, where n is a single digit, matches the substring previously matched by the nth
parenthesized subexpression of the regular expression.
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead
use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead,
so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start
of an invalid interval specification. For example, the command grep -E '{1' searches for the two-character
string {1 instead of reporting a syntax error in the regular expression. POSIX.2 allows this behavior as an
extension, but portable scripts should avoid it.
Regular expressions are awesome!

Related

Regex to Glob and vice-versa conversion

We have a requirement where we want to convert Regex to cloudfront supported Glob and vice-versa. Any suggestion how can we achieve that and first of all whether it's possible?especially from Regex to Glob, as I understand regex is kind of superset so it might not be possible to convert all the Regex to corresponding Glob?
To convert from a glob you would need to write a parser that split the pattern into an abstract syntax tree. For example, the glob *-{[0-9],draft}.docx might parse to [Anything(), "-", OneOf([Range("0", "9"), "draft"]), ".docx"].
Then you would walk the AST and output the equivalent regular expression for each node. For example, the rules you might use for this could be:
Anything() -> .*
Range(x, y) -> [x-y]
OneOf(x, y) -> (x|y)
resulting in the regular expression .*-([0-9]|draft).docx.
That's not perfect, because you also have to remember to escape any special characters; . is a special character in regular expressions, so you should escape it, yielding finally .*-([0-9]|draft)\.docx.
Strictly speaking regular expression cannot all be translated to glob patterns. The Kleene star operation does not exist in globbing; the simple regular expression a* (i.e., any number of a characters) cannot be translated to a glob pattern.
I'm not sure what types of globs CloudFront supports (the documentation returned no hits for the term "glob"), but here is some documentation on commonly-supported shell glob pattern wildcards.
Here is a summarization of the some equivalent sequences:
Glob Wildcard
Regular Expression
Meaning
?
.
Any single character
*
.*
Zero or more characters
[a-z]
[a-z]
Any character from the range
[!a-m]
[^a-m]
A character not in the range
[a,b,c]
[abc]
One of the given characters
{cat,dog,bat}
(cat|dog|bat)
One of the given options
{*.tar,*.gz}
(.*\.tar|.*\.gz)
One of the given options, considering nested wildcards

Regular expression in Snowflake - starts with string and ends with digits

I am struggling with writing regex expression in Snowflake.
SELECT
'DEM7BZB01-123' AS SKU,
RLIKE('DEM7BZB01-123','^DEM.*\d\d$') AS regex
I would like to find all strings that starts with "DEM" and ends with two digits. Unfortunately the expression that I am using returns FALSE.
I was checking this expression in two regex generators and it worked.
In snowflake the backslash character \ is an escape character.
Reference: Escape Characters and Caveats
So you need to use 2 backslashes in a regex to express 1.
SELECT
'DEM7BZB01-123' AS SKU,
RLIKE('DEM7BZB01-123', '^DEM.*\\d\\d$') AS regex
Or you could write the regex pattern in such a way that the backslash isn't used.
For example, the pattern ^DEM.*[0-9]{2}$ matches the same as the pattern ^DEM.*\d\d$.
You need to escape your backslashes in your SQL before it can be parsed as a regex string. (sometimes it gets a bit silly with the number of backslashes needed)
Your example should look like this
RLIKE('DEM7BZB01-123','^DEM.*\\d\\d$') AS regex
RLIKE (which is an alias in Snowflake for the SQL Standard REGEXP_LIKE function) implicitly adds ^ and $ to your search pattern...
The function implicitly anchors a pattern at both ends (i.e. '' automatically becomes '^$', and 'ABC' automatically becomes '^ABC$').
so you can remove them, and that then allows you to use $$ quoting
In single-quoted string constants, you must escape the backslash character in the backslash-sequence. For example, to specify \d, use \d. For details, see Specifying Regular Expressions in Single-Quoted String Constants (in this topic).
You do not need to escape backslashes if you are delimiting the string with pairs of dollar signs ($$) (rather than single quotes).
so you can simply use the regex DEM.*\d\d to find all strings that starts with DEM and ends with two digits without extra escaping as follows
SELECT
'DEM7BZB01-123' AS SKU
, RLIKE('DEM7BZB01-123', $$DEM.*\d\d$$) AS regex
which gives
SKU |REGEX|
-------------+-----+
DEM7BZB01-123|true |

Which characters combined with ^ don't need to be escaped in sed?

I have checked that ^* and ^& match lines beginning by * and &, which I didn't since they are special characters. But ^[ doesn't work. Is this "standard" behavior? Is there any rationale behind this?
sed version used was "GNU sed 4.4".
From POSIX.1-2017:
The sed utility shall support the BREs described in XBD Basic Regular Expressions, ... [sed]
Reading the POSIX section on BREs, we read:
A BRE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character is a BRE that matches the special character itself. The BRE special characters and the contexts in which they have their special meaning are as follows:
.[\:
The <period>, <left-square-bracket>, and <backslash> shall be special except when used in a bracket expression (see RE Bracket Expression). An expression containing a '[' that is unescaped and is not part of a bracket expression produces undefined results.
*:
The <asterisk> shall be special except when used:
In a bracket expression
As the first character of an entire BRE (after an initial '^', if any)
As the first character of a subexpression (after an initial '^', if any); see BREs Matching Multiple Characters
^:
The <circumflex> shall be special when used as an anchor (see BRE Expression Anchoring). The <circumflex> shall signify a non-matching list expression when it occurs first in a list, immediately following a <left-square-bracket> (see RE Bracket Expression).
$:
The <dollar-sign> shall be special when used as an anchor.
source: Basic Regular Expressions, Special characters
So to answer the OPs question using the above:
& is not a special character, so ^& is expected to work
[ should always be escaped if it is not used as a bracket expression.
* is not special after an initial ^ when the latter is an anchor.
So all observed statements by the OP are therefore valid.
There is however still an interesting paragraph in RE Bracket Expression:
A bracket expression is either a matching list expression or a non-matching list expression. It consists of one or more expressions: ordinary characters, collating elements, collating symbols, equivalence classes, character classes, or range expressions. The <right-square-bracket> ( ] ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial <circumflex>( ^ ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as [.].] ) or is the ending <right-square-bracket> for a collating symbol, equivalence class, or character class. The special characters ., *, [, and \\ ( <period>, <asterisk>, <left-square-bracket>, and <backslash>, respectively) shall lose their special meaning within a bracket expression.
source: Basic Regular Expressions, RE Bracket Expression
This implies that ] cannot be escaped in a bracket expression. This means:
The following work:
$ echo '[]' | sed 's/[^]x]/a/'
a]
$ echo '[]' | sed 's/[^x[.].]]/a/'
a]
but this does not work as expected:
$ echo '[]' | sed 's/[^x\]]/a/'
[]
So in a Bracket Expression, dont escape it, but collate it!
See sed "3.3 Overview of Regular Expression Syntax" documentation.
The & char is not a special regex char, it does not need escaping in a regex pattern. Note that & can be parsed as a special construct in the replacement pattern where is refers to the whole match.
The * is not special when it is at the start in GNU sed (^* is a pattern that matches a * at the start of the string):
POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this and portable scripts should instead use \* in these contexts.
The [ starts a bracket expression and must have a paired ] to close the expression, hence it is an error.

special characters in sed

Does anybody know what the complete list of special characters in sed are ?
Please don't give an answer like, it is the same list of special characters for grep, because that just transforms my question to: Does anybody know what the complete list of special characters in grep are?
It depends. Strictly speaking, a standard compliant sed must only use Basic Regular Expressions for which the standard states:
The BRE special characters and the contexts in which they have their special meaning are as follows:
.[\ The period, left-square-bracket, and backslash shall be special except when used in a bracket expression (see RE Bracket Expression ). An expression containing a '[' that is not preceded by a backslash and is not part of a bracket expression produces undefined results.
* The asterisk shall be special except when used in a bracket expression, as the first character of an entire BRE (after an initial '^' , if any), or as the first character of a subexpression (after an initial '^' , if any); see BREs Matching Multiple Characters
^ The circumflex shall be special when used as an anchor (see BRE Expression Anchoring )
or as the first character of a bracket expression (see RE Bracket Expression )
$ The dollar-sign shall be special when used as an anchor.
So the complete list is .[\*^$, but context matters. Also, many sed provide options to use extended regular expressions(EREs), which will expand the list and change the context in which characters are special. For example, without EREs groupings are formed using \( and \), but with EREs ( and ) by themselves are special and must be escaped to be matched literally.
I think this is the full list of characters [\^$.|?*+() on which sed will respond in a manner different than a normal character.

Using escape characters inside grep

I have the following regular expression for eliminating spaces, tabs, and new lines: [^ \n\t]
However, I want to expand this for certain additional characters, such as > and <.
I tried [^ \n\t<>], which works well for now, but I want the expression to not match if the < or > is preceded by a \.
I tried [^ \n\t[^\\]<[^\\]>], but this did not work.
Can any one of the sequences below occur in your input?
\\>
\\\>
\\\\>
\blank
\tab
\newline
...
If so, how do you propose to treat them?
If not, then zero-width look-behind assertions will do the trick, provided that your regular expression engine supports it. This will be the case in any engine that supports Perl-style regular expressions (including Perl's, PHP, etc.):
(?<!\\)[ \n\t<>]
The above will match any un-escaped space, newline, tab or angled braces. More generically (using \s to denote any space characters, including \r):
(?<!\\)\s
Alternatively, using complementary notation without the need for a zero-width look-behind assertion (but arguably less efficiently):
(?:[^ \n\t<>]|\\[<>])
You may also use a variation of the latter to handle the \\>, \\\>, \\\\> etc. cases as well up to some finite number of preceding backslashes, such as:
(?:[^ \n\t<>]|(?:^|[^<>])[\\]{1,3,5,7,9}[<>])
According to the grep man page:
A bracket expression is a list of
characters enclosed by [ and ]. It
matches any single character in that
list; if the first character of the
list is the caret ^ then it matches
any character not in the list.
This means that you can't match a sequence of characters such as \< or \> only single characters.
Unless you have a version of grep built with Perl regex support then you can use lookarounds like one of the other posters mentioned. Not all versions of grep have this support though.
Maybe you can use egrep and put your pattern string inside quotes. This should obliterate the need for escaping.