"diff" tool's flavor of regex seems lacking? - regex

I have two files I've been trying to compare with diff. The files are automatically generated and feature a number of lines that look like:
//! Generated Date : Mon, 14, Dec 2009
I'd like those differences to be ignored, and have set out to use the "-I REGEX" flag to make that happen.
However, the number of spaces that appear between "Date" and the colon varies and unfortunately, it seems the flavor of regular expressions employed by diff lacks a number of the basic regex utilities.
For instance, I cannot for the life of me get the "one or more" plus-sign to work. Same deal with the "\s" representation of whitespace.
diff -I '.*Generated Date\s+:.*' ....
and
diff -I '.*Generated Date +:.*' ....
both fail spectacularly.
Rather than continuing to blindly try things, can somebody out there point me to a good reference on the diff-specific subset of regular expressions?
Thanks!
===== EDIT =======
Thanks to FalseVinylShrub, I've established that I should be escaping my '+' and any similar characters. This fixes the problem somewhat. Diff successfully matches
.*Generated Date \+.*
and
.*Generated Date *.*
(Note that there are two spaces between "Date" and "*".)
However, the second I try to add the ':' to that expression, like so:
.*Generated Date \+:.*
and
.*Generated Date \+\:.*
Both versions fail to match the string in question and cause diff to take a significantly greater amount of time to run. Any thoughts there?

Very interesting... I couldn't find a documentation reference, but a little experimentation found that:
␠* and .* worked if zero-or-more is OK for you
As you said, ␠+ doesn't work. Neither did ␠{1,}... but ␠\{1,\} did work
UPDATE: ␠\+ also works!
(␠ is representing a space character, that didn't show up).
I'm using GNU diff from GNU diffutils 2.8.1.
man diff and info diff didn't explain the RE syntax.
Hope this helps.
UPDATE: I found a brief section in man grep:
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and )
lose their special meaning; instead use the backslashed versions \?,
\+, \{, \|, \(, and \).
So I guess it's using Basic regex syntax.

Ok, here's what the GNU diff source says.
re_set_syntax (RE_SYNTAX_GREP | RE_NO_POSIX_BACKTRACKING);
I think that means, "same as gnu grep -G" (Basic Regular Expression). According to the gnu grep man page:
In basic regular expressions the meta-characters ?, +, {, |, (,
and )
lose their special meaning; instead use the backslashed versions
\?, \+, \{, \|, \(, and \).
Forget about \s, \S, etc.

According to the specification, diff doesn't support regular expressions, nor does it have an -I switch.
You appear to be using a non-standard diff with non-standard extensions. How those non-standard extensions work, should be described in the documentation of whatever non-standard diff you are using.

Related

What flavor of regex does git use

I'm trying to use the git diff --word-diff-regex= command and it seems to reject any types of lookaheads and lookbehinds. I'm having trouble pinning down what flavor of regex git uses. For example
git diff --word-diff-regex='([.\w]+)(?!>)'
Comes back as an invalid regular expression.
I am trying to get all the words that are not HTML tags. So the resulting matches of the regex should be 'Hello' 'World' 'Foo' 'Bar' for the below string
<p> Hello World </p><p> Foo Bar </p>
The Git source uses regcomp and regexec, which are defined by POSIX 1003.2. The code to compile a diff regexp is:
if (regcomp(ecbdata->diff_words->word_regex,
o->word_regex,
REG_EXTENDED | REG_NEWLINE))
which in POSIX means that these are "extended" regular expressions as defined here.
(Not every C library actually implements the same POSIX REG_EXTENDED. Git includes its own implementation, which can be built in place of the system's.)
Edit (per updated question): POSIX EREs have neither lookahead nor lookbehind, nor do they have \w (but [_[:alnum:]] is probably close enough for most purposes).
Thanks for the hints from #torek 's answer above, now I realize that there are different flavors of regular expression engines and they could even have different syntax.
Even for one particular program, such as git, it could be compiled with a different regex engine. For example, this blog post hints that \w would be supported by git, contradicting with what I observed from my machine or what the OP here asked.
I ended up finding this section from your recommended wikipedia page most helpful, in terms of presenting different syntax in one table, so that I could do some "translation" between for example [:alnum:] and \w, [:digit:] and \d, [:space:] and \s, etc..

Bash "diff" utility showing files as different when using a regex Ignore

I'm trying to use the bash utility "diff" that is documented here: http://ss64.com/bash/diff.html. Note that I'm using a windows-ported version of the bash utility, but that shouldn't make any difference.
I have two files, regex_test_1.txt and regex_test_2.txt that have the following contents:
regex_test_1.txt:
// $Id: some random id string $ more text
text that matches
regex_test_2.txt:
// $Id: some more random id string $ more text
text that matches
I am trying to diff these files while ignoring any lines that fit this regex:
.*\$(Id|Header|Date|DateTime|Change|File|Revision|Author):.*\$.*
However, when I run diff and tell it to ignore lines matching this regex using the -I argument, this is the output:
C:\Users\myname\Documents>diff -q -r -I ".*\$(Id|Header|Date|DateTime|Change|File|Revision|Author):.*\$.*" regex_test_1.txt regex_test_2.txt
Files regex_test_1.txt and regex_test_2.txt differ
I expect that it should find no differences (and report nothing). Why is it finding these files to be different?
It's because diff uses basic regex syntax, wherein certain regex metacharacters lose their special meaning:
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
This should work:
.*\$\(Id\|Header\|Date\|DateTime\|Change\|File\|Revision\|Author\):.*\$.*
Just for giggles, add -b to your diff. Ignore differences in white space.

SED regular expressions trouble

I have build the following regular expression in order to fix a big sql dump with invalid tags
This searches
\[ame=(?:\\"){0,1}(?:http://){0,1}(http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^&,",\\]+))[^\]]*\].+?video\]|\[video\](http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^\[,&,\\,"]+))\[/video\]
This replaces
[video=youtube;$2$4]$1$3[/video]
So this:
[ame=\"http://www.youtube.com/watch?v=FD5ArmOMisM\"]YouTube - Official Install Of X360FDU![/video]
will become
[video=youtube;FD5ArmOMisM]http://www.youtube.com/watch?v=FD5ArmOMisM[/video]
It behaves like a charm in EditPadPro (Windows) but it gives me conflicts with the codepages when I try to import it in my Linux based MySQL.
So since the file comes from a Linux installation I tried my luck with SED but it gives me errors errors errors. Obviously it has a different way to build regular expressions.
It is quite urgent to do the substitutions so I have no time reading the SED manual.
Can you give a hand to migrate my regular expressions to a SED friendly format?
Thanx in advance!
UPDATE: I added the escape chars proposed
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\))[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\))\[\/video\]
but I still get errors - Unkown command: ')'
Your regular expressions are using PCRE - Perl Compatible Regular Expression - notations. As defined by POSIX (codifying what was standardized by 7th Edition Unix circa 1978, which was a continuation of the previous versions of Unix), sed does not support PCRE.
Even GNU sed version 4.2.1, which supports ERE (extended regular expressions) as well as BRE (basic regular expressions) does not support PCRE.
Your best bet is probably to use Perl to provide you with the PCRE you need. Failing that, take the scripting language of your choice with PCRE support.
Sed just has some different escaping rules to the Regex flavor you're using.
() escaped \( \) - for grouping
[] are not - for character classes
{} escaped \{ \} - for numerators
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\)\)[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\)\)\[\/video\]
I noticed a couple of unescaped )'s on enclosing groups.

What's the difference between vim regex and normal regex?

I noticed that vim's substitute regex is a bit different from other regexp. What's the difference between them?
"Regular expression" really defines algorithms, not a syntax. What that means is that different flavours of regular expressions will use different characters to mean the same thing; or they'll prefix some special characters with backslashes where others don't. They'll typically still work in the same way.
Once upon a time, POSIX defined the Basic Regular Expression syntax (BRE), which Vim largely follows. Very soon afterwards, an Extended Regular Expression (ERE) syntax proposal was also released. The main difference between the two is that BRE tends to treat more characters as literals - an "a" is an "a", but also a "(" is a "(", not a special character - and so involves more backslashes to give them "special" meaning.
The discussion of complex differences between Vim and Perl on a separate comment here is useful, but it's also worth mentioning a few of the simpler ways in which Vim regexes differ from the "accepted" norm (by which you probably mean Perl.) As mentioned above, they mostly differ in their use of a preceding backslash.
Here are some obvious examples:
Perl Vim Explanation
---------------------------
x? x\= Match 0 or 1 of x
x+ x\+ Match 1 or more of x
(xyz) \(xyz\) Use brackets to group matches
x{n,m} x\{n,m} Match n to m of x
x*? x\{-} Match 0 or 1 of x, non-greedy
x+? x\{-1,} Match 1 or more of x, non-greedy
\b \< \> Word boundaries
$n \n Backreferences for previously grouped matches
That gives you a flavour of the most important differences. But if you're doing anything more complicated than the basics, I suggest you always assume that Vim-regex is going to be different from Perl-regex or Javascript-regex and consult something like the Vim Regex website.
If by "normal regex" you mean Perl-Compatible Regular Expressions (PCRE), then the Vim help provides a good summary of the differences between Vim's regexes and Perl's:
:help perl-patterns
Here's what it says as of Vim 7.2:
9. Compare with Perl patterns *perl-patterns*
Vim's regexes are most similar to Perl's, in terms of what you can do. The
difference between them is mostly just notation; here's a summary of where
they differ:
Capability in Vimspeak in Perlspeak ~
----------------------------------------------------------------
force case insensitivity \c (?i)
force case sensitivity \C (?-i)
backref-less grouping \%(atom\) (?:atom)
conservative quantifiers \{-n,m} *?, +?, ??, {}?
0-width match atom\#= (?=atom)
0-width non-match atom\#! (?!atom)
0-width preceding match atom\#<= (?<=atom)
0-width preceding non-match atom\#<! (?!atom)
match without retry atom\#> (?>atom)
Vim and Perl handle newline characters inside a string a bit differently:
In Perl, ^ and $ only match at the very beginning and end of the text,
by default, but you can set the 'm' flag, which lets them match at
embedded newlines as well. You can also set the 's' flag, which causes
a . to match newlines as well. (Both these flags can be changed inside
a pattern using the same syntax used for the i flag above, BTW.)
On the other hand, Vim's ^ and $ always match at embedded newlines, and
you get two separate atoms, \%^ and \%$, which only match at the very
start and end of the text, respectively. Vim solves the second problem
by giving you the \_ "modifier": put it in front of a . or a character
class, and they will match newlines as well.
Finally, these constructs are unique to Perl:
- execution of arbitrary code in the regex: (?{perl code})
- conditional expressions: (?(condition)true-expr|false-expr)
...and these are unique to Vim:
- changing the magic-ness of a pattern: \v \V \m \M
(very useful for avoiding backslashitis)
- sequence of optionally matching atoms: \%[atoms]
- \& (which is to \| what "and" is to "or"; it forces several branches
to match at one spot)
- matching lines/columns by number: \%5l \%5c \%5v
- setting the start and end of the match: \zs \ze
Try Vim's very magic regex mode. It behaves more like traditional regex, just prepend your pattern with \v. See :help /\v for more info. I love it.
There is a plugin called eregex.vim which translates from PCRE (Perl-compatible regular expressions) to Vim's syntax. It takes over a thousand lines of vim to achieve that translation! I guess it also serves as precise documentation of the differences.
Too broad question. Run vim and type :help pattern.

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.