Seeking comparison table for different regexes - regex

I use vim, sed, bash and Perl. Each has somewhat different regex syntax. I just spent time finding that I need to escape the curly parens in sed, but not in BASH (when using them as counter elements). Grrr.
Can anybody point me to a table that summarizes the differences between the different regex parsers in these 4 environments.
TIA

http://www.regular-expressions.info/refflavors.html - scroll down a bit.
Bash uses posix regexes. Sed and vim (which uses ed) use what are listed as "GNU BRE", although this depends on what flags you pass.

Jan Goyvaerts.'s site regular-expressions.info has a listing of popular regex engines and which options they support.

Related

How to comprehend expression/pattern in find/grep/rsync?

I have to use find, grep and rsync commands for my program. Generally, I rarely used all of these in a single script so didn't notice earlier. Is there a category of regular-expression that fit these commands like:
find command: follows regex type1
grep command: follows regex type2
rsync command: follows regex type3
For example, for finding all the paths which lead to my program log file, we can do:
find -type f -name "foo.log*"
Here, in the above command, the star is not acting like a proper regular expression, as in regex, the star corresponds to the zero/one/multiple instances of the immediate before expression which is character('g') in this case? So if it actually follows regex, it can match filenames like:
foo.lo
foo.log
foo.logg
foo.loggg
and so on...
Similar to find command, the rsync behave when given expression for its source and destination path. While on the other hand, I noticed the grep command do follow the regular expression.
So, in total:
Do all of these commands follow a different kind of regular expression?
Or some of them follows regex while some of them do not, and if not, then what pattern they follow? Basically, I'm looking for the generalisation of the patterns of all these tools?
I'm new to Linux tools. Please guide!
There is a big difference between wildcards and regular expressions.
Wildcards:
special characters that define a simple search pattern
used by shells (bash, old MS-DOS, ...), and by many unix commands (find, ...)
limited set of wildcards, typically just:
* - zero or more chars (any combination)
? - exactly one char (any char)
[...] - exactly one char out of a set or range of chars, such as [0-9a-f] for a hex digit
see tutorial: https://linuxhint.com/bash_wildcard_tutorial/
Regular Expression:
a sequence of characters that define a search pattern
think of regular expressions (regex for short) as wildcards on steroids
regex patterns are used to find or find and replace strings
powerful language, natively supported by most programming languages
there are different flavors of regular expressions, typically grouped into these categories:
POSIX Basic (BRE - Basic Regular Expressions)
POSIX Extended (ERE - Extended Regular Expressions)
Perl and PCRE (Perl Compatible Regular Expressions)
JavaScript
many more flavors, see https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines
some unix commands allow you to select one regex flavor or another; for example:
grep uses POSIX Basic by default
grep -E or egrep uses POSIX Extended
grep -Puses Perl
Wikipedia article: https://en.wikipedia.org/wiki/Regular_expression
tutorial: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

What flavor of regex does git use

I'm trying to use the git diff --word-diff-regex= command and it seems to reject any types of lookaheads and lookbehinds. I'm having trouble pinning down what flavor of regex git uses. For example
git diff --word-diff-regex='([.\w]+)(?!>)'
Comes back as an invalid regular expression.
I am trying to get all the words that are not HTML tags. So the resulting matches of the regex should be 'Hello' 'World' 'Foo' 'Bar' for the below string
<p> Hello World </p><p> Foo Bar </p>
The Git source uses regcomp and regexec, which are defined by POSIX 1003.2. The code to compile a diff regexp is:
if (regcomp(ecbdata->diff_words->word_regex,
o->word_regex,
REG_EXTENDED | REG_NEWLINE))
which in POSIX means that these are "extended" regular expressions as defined here.
(Not every C library actually implements the same POSIX REG_EXTENDED. Git includes its own implementation, which can be built in place of the system's.)
Edit (per updated question): POSIX EREs have neither lookahead nor lookbehind, nor do they have \w (but [_[:alnum:]] is probably close enough for most purposes).
Thanks for the hints from #torek 's answer above, now I realize that there are different flavors of regular expression engines and they could even have different syntax.
Even for one particular program, such as git, it could be compiled with a different regex engine. For example, this blog post hints that \w would be supported by git, contradicting with what I observed from my machine or what the OP here asked.
I ended up finding this section from your recommended wikipedia page most helpful, in terms of presenting different syntax in one table, so that I could do some "translation" between for example [:alnum:] and \w, [:digit:] and \d, [:space:] and \s, etc..

Why would a regex work in Sublime and not in vim?

Tried searching for regex found in this answer:
(,)(?=(?:[^']|'[^']*')*$)
I tried doing a search in Sublime and it worked out (around 700 results). When trying to replace the results it runs out of memory. Tried /(,)(?=(?:[^']|'[^']*')*$) in vim for searching first but it does not find any instances of the pattern. Also tried escaping all the ( and ) with \ in the regex.
Vim uses its own regular expression engine and syntax (which predates PCRE, by the way) so porting a regex from perl or some other editor will most likely need some work.
The many differences are too numerous to list in detail here but :help pattern and :help perl-patterns will help.
Anyway, this quick and dirty rewrite of your regular expression seems to work on the sample given in the linked question:
/\v(,)(\#=([^']|'[^']*')*$)
See :help \#= and :help \v.
One possible explanation is that the regular expression engine used in Sublime is different than the engine used in vim.
Not all regex engines are created equal; they don't all support the same features. (For example, a "negative lookahead" feature can be very powerful, but not all engines support it. And the syntax for some features differs betwen engines.)
A brief comparison of regular expression engines is available here:
http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines
Unfortunately Vim uses a different engine, and "normal" regular expressions won't work.
The regex you've mentioned isn't perfect: it doesn't skip escaped quotes, but, as I understand, it's good enough for you. Try this one, and if it doesn't match something, please send me that piece.
\v^([^']|'[^']*')*\zs,
A little explanation:
\v enables very magic search to avoid complex escaping rules
([^']|'[^']*') matches all symbols but quote and a pair of qoutes
\zs indicates the beginning of selection; you can think of it as of a replacement for lookbehind.
You have to escape the |, otherwise it doesn't work under vim. You should also escape the round brackets, unless you are searching for the '(' or ')' characters.
More information on regex usage in vim can be found on vimregex.com.

SED regular expressions trouble

I have build the following regular expression in order to fix a big sql dump with invalid tags
This searches
\[ame=(?:\\"){0,1}(?:http://){0,1}(http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^&,",\\]+))[^\]]*\].+?video\]|\[video\](http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^\[,&,\\,"]+))\[/video\]
This replaces
[video=youtube;$2$4]$1$3[/video]
So this:
[ame=\"http://www.youtube.com/watch?v=FD5ArmOMisM\"]YouTube - Official Install Of X360FDU![/video]
will become
[video=youtube;FD5ArmOMisM]http://www.youtube.com/watch?v=FD5ArmOMisM[/video]
It behaves like a charm in EditPadPro (Windows) but it gives me conflicts with the codepages when I try to import it in my Linux based MySQL.
So since the file comes from a Linux installation I tried my luck with SED but it gives me errors errors errors. Obviously it has a different way to build regular expressions.
It is quite urgent to do the substitutions so I have no time reading the SED manual.
Can you give a hand to migrate my regular expressions to a SED friendly format?
Thanx in advance!
UPDATE: I added the escape chars proposed
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\))[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\))\[\/video\]
but I still get errors - Unkown command: ')'
Your regular expressions are using PCRE - Perl Compatible Regular Expression - notations. As defined by POSIX (codifying what was standardized by 7th Edition Unix circa 1978, which was a continuation of the previous versions of Unix), sed does not support PCRE.
Even GNU sed version 4.2.1, which supports ERE (extended regular expressions) as well as BRE (basic regular expressions) does not support PCRE.
Your best bet is probably to use Perl to provide you with the PCRE you need. Failing that, take the scripting language of your choice with PCRE support.
Sed just has some different escaping rules to the Regex flavor you're using.
() escaped \( \) - for grouping
[] are not - for character classes
{} escaped \{ \} - for numerators
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\)\)[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\)\)\[\/video\]
I noticed a couple of unescaped )'s on enclosing groups.

Multi-line regex support in Vim

I notice the standard regex syntax for matching across multiple lines is to use /s, like so:
This is\nsome text
/This.*text/s
This works in Perl for instance but doesn't seem to be supported in Vim. Instead, I have to be much more specific:
/This[^\r\n]*[\r\n]*text/
I can't find any reason for why this should be, so I'm thinking I probably just missed the relevant bits in the vim help.
Can anyone confirm this behaviour one way or the other?
Yes, Perl's //s modifier isn't available on Vim regexes. See :h perl-patterns for details and a list of other differences between Vim and Perl regexes.
Instead you can use \_., which means "match any single character including newline". It's a bit shorter than what you have. See :h /\_..
/This\_.*text/