What flavor of regex does git use - regex

I'm trying to use the git diff --word-diff-regex= command and it seems to reject any types of lookaheads and lookbehinds. I'm having trouble pinning down what flavor of regex git uses. For example
git diff --word-diff-regex='([.\w]+)(?!>)'
Comes back as an invalid regular expression.
I am trying to get all the words that are not HTML tags. So the resulting matches of the regex should be 'Hello' 'World' 'Foo' 'Bar' for the below string
<p> Hello World </p><p> Foo Bar </p>

The Git source uses regcomp and regexec, which are defined by POSIX 1003.2. The code to compile a diff regexp is:
if (regcomp(ecbdata->diff_words->word_regex,
o->word_regex,
REG_EXTENDED | REG_NEWLINE))
which in POSIX means that these are "extended" regular expressions as defined here.
(Not every C library actually implements the same POSIX REG_EXTENDED. Git includes its own implementation, which can be built in place of the system's.)
Edit (per updated question): POSIX EREs have neither lookahead nor lookbehind, nor do they have \w (but [_[:alnum:]] is probably close enough for most purposes).

Thanks for the hints from #torek 's answer above, now I realize that there are different flavors of regular expression engines and they could even have different syntax.
Even for one particular program, such as git, it could be compiled with a different regex engine. For example, this blog post hints that \w would be supported by git, contradicting with what I observed from my machine or what the OP here asked.
I ended up finding this section from your recommended wikipedia page most helpful, in terms of presenting different syntax in one table, so that I could do some "translation" between for example [:alnum:] and \w, [:digit:] and \d, [:space:] and \s, etc..

Related

What Raku regex modifier makes a dot match a newline (like Perl's /s)?

How do I make the dot (.) metacharacter match a newline in a Raku regex? In Perl, I would use the dot matches newline modifier (/s)?
TL;DR The Raku equivalent for "Perl dot matches newline" is ., and for \Q...\E it's ....
There are ways to get better answers (more authoritative, comprehensive, etc than SO ones) to most questions like these more easily (typically just typing the search term of interest) and quickly (typically seconds, couple minutes tops). I address that in this answer.
What is Raku equivalent for "Perl dot matches newline"?
Just .
If you run the following Raku program:
/./s
you'll see the following error message:
Unsupported use of /s. In Raku please use: . or \N.
If you type . in the doc site's search box it lists several entries. One of them is . (regex). Clicking it provides examples and says:
An unescaped dot . in a regex matches any single character. ...
Notably . also matches a logical newline \n
My guess is you either didn't look for answers before asking here on SO (which is fair enough -- I'm not saying don't; that said you can often easily get good answers nearly instantly if you look in the right places, which I'll cover in this answer) or weren't satisfied by the answers you got (in which case, again, read on).
In case I've merely repeated what you've already read, or it's not enough info, I will provide a better answer below, after I write up an initial attempt to give a similar answer for your \Q...\E question -- and fail when I try the doc step.
What is Raku equivalent for Perl \Q...\E?
'...', or $foo if the ... was metasyntax for a variable name.
If you run the following Raku program:
/\Qfoo\E/
you'll see the following error message:
Unsupported use of \Q as quotemeta. In Raku please use: quotes or
literal variable match.
If you type \Q...\E in the doc site's search box it lists just one entry: Not in Index (try site search). If you go ahead and try the search as suggested, you'll get matching pages according to google. For me the third page/match listed (Perl to Raku guide - in a nutshell: "using String::ShellQuote (because \Q…\E is not completely right) ...") is the only true positive match of \Q...\E among 27 matches. And it's obviously not what you're interested in.
So, searching the doc for \S...\E appears to be a total bust.
How does one get answers to a question like "what is the Raku equivalent of Perl's \Q...\E?" if the doc site ain't helpful (and one doesn't realize Rakudo happens to have a built in error message dedicated to the exact thing of interest and/or isn't sure what the error message means)? What about questions where neither Rakudo nor the doc site are illuminating?
SO is one option, but what lets folk interested in Raku frequently get good/great answers to their questions easily and quickly when they can't get them from the doc site because the answer is hard to find or simply doesn't exist in the docs?
Easily get better answers more quickly than asking SO Qs
The docs website doesn't always yield a good answer to simple questions. Sometimes, as we clearly see with the \Q...\E case, it doesn't yield any answer at all for the relevant search term.
Fortunately there are several other easily searchable sources of rich and highly relevant info that often work when the doc site does not for certain kinds of info/searches. This is especially likely if you've got precise search terms in mind such as /s or \Q...\E and/or are willing browse info provided it's high signal / low noise. I'll introduce two of these resources in the remainder of this answer.
Archived "spec" docs
Raku's design was written up in a series of "spec" docs written principally by Larry Wall over a 2 decade period.
(The word "specs" is short for "specification speculations". It's both ultra authoritative detailed and precise specifications of the Raku language, authored primarily by Larry Wall himself, and mere speculations -- because it was all subject to implementation. And the two aspects are left entangled, and now out-of-date. So don't rely on them 100% -- but don't ignore them either.)
The "specs", aka design docs, are a fantastic resource. You can search them using google by entering your search terms in the search box at design.raku.org.
A search for /s lists 25 pages. The only useful match is Synopsis 5: Regexes and Rules ("24 Jun 2002 — There are no /s or /m modifiers (changes to the meta-characters replace them - see below)." Click it. Then do an in-page search for /s (note the space). You'll see 3 matches:
There are no /s or /m modifiers (changes to the meta-characters replace them - see below)
A dot . now matches any character including newline. (The /s modifier is gone.)
. matches an anything, while \N matches an anything except what \n matches. (The /s modifier is gone.) In particular, \N matches neither carriage return nor line feed.
A search for \Q...\E lists 7 pages. The only useful match is again Synopsis 5: Regexes and Rules ("24 Jun 2002 — \Q$var\E / ..."). Click it. Then do an in-page search for \Q. You'll see 2 matches:
In Raku / $var / is like a Perl / \Q$var\E /
\Q...\E sequences are gone.
Chat logs
I've expanded the Quicker answers section of my answer to one of your earlier Qs to discuss searching the Raku "chat logs". They are an incredibly rich mine of info with outstanding search features. Please read that section of my prior answer for clear general guidance. The rest of this answer will illustrate for /s and \Q...\E.
A search for the regex / newline . ** ^200 '/s' / in the old Raku channel from 2010 thru 2015 found this match:
. matches an anything, while \N matches an anything except what \n matches. (The /s modifier is gone.) In particular, \N matches neither carriage return nor line feed.
Note the shrewdness of my regex. The pattern is the word "newline" (which is hopefully not too common) followed within 200 characters by the two character sequence /s (which I suspect is more common than newline). And I constrained to 2010-2014 because a search for that regex of the entire 15 years of the old Raku channel would tax Liz's server and time out. I got that hit I've quoted above within a couple minutes of trying to find some suitable match of /s (not end-of-sarcasm!).
A search for \Q in the old Raku channel was an immediate success. Within 30 seconds of the thought "I could search the logs" I had a bunch of useful matches.

Powershell compatible regex flavors

I am testing in Powershell, some regex that I get from a program written in another language.
But the regex arent working properly, I know that depending on the regex flavor like PCRE,POSIX. The regex is interpreted in different way.
My question is what are the compatible regex Flavors for powershell?
The correct flavor is .Net regex, but in online testers like debuggex that don't have it, I use PCRE and it seems to work fairly well.
Other issues you may run into are whether or not you need to escape certain characters in the string for powershell (irrespective of characters that need to be escaped for the regex engine).

Why would a regex work in Sublime and not in vim?

Tried searching for regex found in this answer:
(,)(?=(?:[^']|'[^']*')*$)
I tried doing a search in Sublime and it worked out (around 700 results). When trying to replace the results it runs out of memory. Tried /(,)(?=(?:[^']|'[^']*')*$) in vim for searching first but it does not find any instances of the pattern. Also tried escaping all the ( and ) with \ in the regex.
Vim uses its own regular expression engine and syntax (which predates PCRE, by the way) so porting a regex from perl or some other editor will most likely need some work.
The many differences are too numerous to list in detail here but :help pattern and :help perl-patterns will help.
Anyway, this quick and dirty rewrite of your regular expression seems to work on the sample given in the linked question:
/\v(,)(\#=([^']|'[^']*')*$)
See :help \#= and :help \v.
One possible explanation is that the regular expression engine used in Sublime is different than the engine used in vim.
Not all regex engines are created equal; they don't all support the same features. (For example, a "negative lookahead" feature can be very powerful, but not all engines support it. And the syntax for some features differs betwen engines.)
A brief comparison of regular expression engines is available here:
http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines
Unfortunately Vim uses a different engine, and "normal" regular expressions won't work.
The regex you've mentioned isn't perfect: it doesn't skip escaped quotes, but, as I understand, it's good enough for you. Try this one, and if it doesn't match something, please send me that piece.
\v^([^']|'[^']*')*\zs,
A little explanation:
\v enables very magic search to avoid complex escaping rules
([^']|'[^']*') matches all symbols but quote and a pair of qoutes
\zs indicates the beginning of selection; you can think of it as of a replacement for lookbehind.
You have to escape the |, otherwise it doesn't work under vim. You should also escape the round brackets, unless you are searching for the '(' or ')' characters.
More information on regex usage in vim can be found on vimregex.com.

SED regular expressions trouble

I have build the following regular expression in order to fix a big sql dump with invalid tags
This searches
\[ame=(?:\\"){0,1}(?:http://){0,1}(http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^&,",\\]+))[^\]]*\].+?video\]|\[video\](http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^\[,&,\\,"]+))\[/video\]
This replaces
[video=youtube;$2$4]$1$3[/video]
So this:
[ame=\"http://www.youtube.com/watch?v=FD5ArmOMisM\"]YouTube - Official Install Of X360FDU![/video]
will become
[video=youtube;FD5ArmOMisM]http://www.youtube.com/watch?v=FD5ArmOMisM[/video]
It behaves like a charm in EditPadPro (Windows) but it gives me conflicts with the codepages when I try to import it in my Linux based MySQL.
So since the file comes from a Linux installation I tried my luck with SED but it gives me errors errors errors. Obviously it has a different way to build regular expressions.
It is quite urgent to do the substitutions so I have no time reading the SED manual.
Can you give a hand to migrate my regular expressions to a SED friendly format?
Thanx in advance!
UPDATE: I added the escape chars proposed
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\))[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\))\[\/video\]
but I still get errors - Unkown command: ')'
Your regular expressions are using PCRE - Perl Compatible Regular Expression - notations. As defined by POSIX (codifying what was standardized by 7th Edition Unix circa 1978, which was a continuation of the previous versions of Unix), sed does not support PCRE.
Even GNU sed version 4.2.1, which supports ERE (extended regular expressions) as well as BRE (basic regular expressions) does not support PCRE.
Your best bet is probably to use Perl to provide you with the PCRE you need. Failing that, take the scripting language of your choice with PCRE support.
Sed just has some different escaping rules to the Regex flavor you're using.
() escaped \( \) - for grouping
[] are not - for character classes
{} escaped \{ \} - for numerators
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\)\)[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\)\)\[\/video\]
I noticed a couple of unescaped )'s on enclosing groups.

"diff" tool's flavor of regex seems lacking?

I have two files I've been trying to compare with diff. The files are automatically generated and feature a number of lines that look like:
//! Generated Date : Mon, 14, Dec 2009
I'd like those differences to be ignored, and have set out to use the "-I REGEX" flag to make that happen.
However, the number of spaces that appear between "Date" and the colon varies and unfortunately, it seems the flavor of regular expressions employed by diff lacks a number of the basic regex utilities.
For instance, I cannot for the life of me get the "one or more" plus-sign to work. Same deal with the "\s" representation of whitespace.
diff -I '.*Generated Date\s+:.*' ....
and
diff -I '.*Generated Date +:.*' ....
both fail spectacularly.
Rather than continuing to blindly try things, can somebody out there point me to a good reference on the diff-specific subset of regular expressions?
Thanks!
===== EDIT =======
Thanks to FalseVinylShrub, I've established that I should be escaping my '+' and any similar characters. This fixes the problem somewhat. Diff successfully matches
.*Generated Date \+.*
and
.*Generated Date *.*
(Note that there are two spaces between "Date" and "*".)
However, the second I try to add the ':' to that expression, like so:
.*Generated Date \+:.*
and
.*Generated Date \+\:.*
Both versions fail to match the string in question and cause diff to take a significantly greater amount of time to run. Any thoughts there?
Very interesting... I couldn't find a documentation reference, but a little experimentation found that:
␠* and .* worked if zero-or-more is OK for you
As you said, ␠+ doesn't work. Neither did ␠{1,}... but ␠\{1,\} did work
UPDATE: ␠\+ also works!
(␠ is representing a space character, that didn't show up).
I'm using GNU diff from GNU diffutils 2.8.1.
man diff and info diff didn't explain the RE syntax.
Hope this helps.
UPDATE: I found a brief section in man grep:
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and )
lose their special meaning; instead use the backslashed versions \?,
\+, \{, \|, \(, and \).
So I guess it's using Basic regex syntax.
Ok, here's what the GNU diff source says.
re_set_syntax (RE_SYNTAX_GREP | RE_NO_POSIX_BACKTRACKING);
I think that means, "same as gnu grep -G" (Basic Regular Expression). According to the gnu grep man page:
In basic regular expressions the meta-characters ?, +, {, |, (,
and )
lose their special meaning; instead use the backslashed versions
\?, \+, \{, \|, \(, and \).
Forget about \s, \S, etc.
According to the specification, diff doesn't support regular expressions, nor does it have an -I switch.
You appear to be using a non-standard diff with non-standard extensions. How those non-standard extensions work, should be described in the documentation of whatever non-standard diff you are using.