What's the difference between vim regex and normal regex? - regex

I noticed that vim's substitute regex is a bit different from other regexp. What's the difference between them?

"Regular expression" really defines algorithms, not a syntax. What that means is that different flavours of regular expressions will use different characters to mean the same thing; or they'll prefix some special characters with backslashes where others don't. They'll typically still work in the same way.
Once upon a time, POSIX defined the Basic Regular Expression syntax (BRE), which Vim largely follows. Very soon afterwards, an Extended Regular Expression (ERE) syntax proposal was also released. The main difference between the two is that BRE tends to treat more characters as literals - an "a" is an "a", but also a "(" is a "(", not a special character - and so involves more backslashes to give them "special" meaning.
The discussion of complex differences between Vim and Perl on a separate comment here is useful, but it's also worth mentioning a few of the simpler ways in which Vim regexes differ from the "accepted" norm (by which you probably mean Perl.) As mentioned above, they mostly differ in their use of a preceding backslash.
Here are some obvious examples:
Perl Vim Explanation
---------------------------
x? x\= Match 0 or 1 of x
x+ x\+ Match 1 or more of x
(xyz) \(xyz\) Use brackets to group matches
x{n,m} x\{n,m} Match n to m of x
x*? x\{-} Match 0 or 1 of x, non-greedy
x+? x\{-1,} Match 1 or more of x, non-greedy
\b \< \> Word boundaries
$n \n Backreferences for previously grouped matches
That gives you a flavour of the most important differences. But if you're doing anything more complicated than the basics, I suggest you always assume that Vim-regex is going to be different from Perl-regex or Javascript-regex and consult something like the Vim Regex website.

If by "normal regex" you mean Perl-Compatible Regular Expressions (PCRE), then the Vim help provides a good summary of the differences between Vim's regexes and Perl's:
:help perl-patterns
Here's what it says as of Vim 7.2:
9. Compare with Perl patterns *perl-patterns*
Vim's regexes are most similar to Perl's, in terms of what you can do. The
difference between them is mostly just notation; here's a summary of where
they differ:
Capability in Vimspeak in Perlspeak ~
----------------------------------------------------------------
force case insensitivity \c (?i)
force case sensitivity \C (?-i)
backref-less grouping \%(atom\) (?:atom)
conservative quantifiers \{-n,m} *?, +?, ??, {}?
0-width match atom\#= (?=atom)
0-width non-match atom\#! (?!atom)
0-width preceding match atom\#<= (?<=atom)
0-width preceding non-match atom\#<! (?!atom)
match without retry atom\#> (?>atom)
Vim and Perl handle newline characters inside a string a bit differently:
In Perl, ^ and $ only match at the very beginning and end of the text,
by default, but you can set the 'm' flag, which lets them match at
embedded newlines as well. You can also set the 's' flag, which causes
a . to match newlines as well. (Both these flags can be changed inside
a pattern using the same syntax used for the i flag above, BTW.)
On the other hand, Vim's ^ and $ always match at embedded newlines, and
you get two separate atoms, \%^ and \%$, which only match at the very
start and end of the text, respectively. Vim solves the second problem
by giving you the \_ "modifier": put it in front of a . or a character
class, and they will match newlines as well.
Finally, these constructs are unique to Perl:
- execution of arbitrary code in the regex: (?{perl code})
- conditional expressions: (?(condition)true-expr|false-expr)
...and these are unique to Vim:
- changing the magic-ness of a pattern: \v \V \m \M
(very useful for avoiding backslashitis)
- sequence of optionally matching atoms: \%[atoms]
- \& (which is to \| what "and" is to "or"; it forces several branches
to match at one spot)
- matching lines/columns by number: \%5l \%5c \%5v
- setting the start and end of the match: \zs \ze

Try Vim's very magic regex mode. It behaves more like traditional regex, just prepend your pattern with \v. See :help /\v for more info. I love it.

There is a plugin called eregex.vim which translates from PCRE (Perl-compatible regular expressions) to Vim's syntax. It takes over a thousand lines of vim to achieve that translation! I guess it also serves as precise documentation of the differences.

Too broad question. Run vim and type :help pattern.

Related

grep regex lookahead or start of string (or lookbehind or end of string)

I want to match a string which may contain a type of character before the match, or the match may begin at the beginning of the string (same for end of string).
For a minimal example, consider the text n.b., which I'd like to match either at the beginning of a line and end of a line or between two non-word characters, or some combination. The easiest way to do this would be to use word boundaries (\bn\.b\.\b), but that doesn't match; similar cases happen for other desired matches with non-word characters in them.
I'm currently using (^|[^\w])n\.b\.([^\w]|$), which works satisfactorily, but will also match the non-word characters (such as dashes) which appear immediately before and after the word, if available. I'm doing this in grep, so while I could easily pipe the output into sed, I'm using grep's --color option, which is disabled when piping into another command (for obvious reasons).
EDIT: The \K option (i.e. (\K^|[^\w])n\.b\.(\K[^\w]|$) seems to work, but it also does discard the color on the match within the output. While I could, again, invoke auxiliary tools, I'd love it if there was a quick and simple solution.
EDIT: I have misunderstood the \K operator; it simply removes all the text from the match preceding its use. No wonder it was failing to color the output.
If you're using grep, you must be using the -P option, or lookarounds and \K would throw errors. That means you also have negative lookarounds at your disposal. Here's a simpler version of your regex:
(?<!\w)n\.b\.(?!\w)
Also, be aware that (?<=...) and (?<!...) are lookbehinds, and (?=...) and (?!...) are lookaheads. The wording of your title suggests you may have gotten those mixed up, a common beginner's mistake.
Apparently matching beginning of string is possible inside lookahead/lookbehinds; the obvious solution is then (?<=^|[^\w])n\.b\.(?=[^\w]|$).

Basic Vim - Search and Replace text bounded by specific characters

Say I wanted to replace :
"Christoph Waltz" = "That's a Bingo";
"Gary Coleman" = "What are you talking about, dear Willis?";
to just have :
"Christoph Waltz"
"Gary Coleman"
i.e. I want to remove all the characters including and after the = and the ;
I thought the regex for finding the pattern would be \=.*?\;. In vim, I tried :
:%s/\=.*?\;$//g
but it gave me an Invalid Command error and Nothing after \=. How do I remove the above text? Apologies, but I'm new to this.
Vim's regular expression dialect is different; its escaping is optimized for text searches. See :help perl-patterns for a comparison with Perl regular expressions. As #EvergreenTree has noted, you can influence the escaping behavior with special atoms; cp. :help /\v.
For your example, the non-greedy match is .\{-}, not .*?, and, as mentioned, you mustn't escape several literal characters:
:%s/ =.\{-};$//
(The /g flag is superfluous, too; there can be only one match anchored to the end via $.)
This is because of vim's weird handling of regexes by default. Instead of \= interpreting as a literal =, it interprets it as a special regex character. To make vim's regex system work more normally, you can prefix it with \v, which is "very magic" mode. This should do the trick:
%s/\v\=.*\;$//g
I won't explain how vim enterprets every single character in very magic mode here, but you can read about it in this help topic:
:help /magic

How to highlight words beginning with ‘#’ in Vim syntax?

I have a very simple Vim syntax file for personal notes. I would like to highlight people's name and I chose a Twitter-like syntax #jonathan.
I tried:
syntax match notesPerson "\<#\S\+"
To mean: words beginning with # and having at least one non-whitespace character. The problem is that # seems to be a special character in Vim regular expressions.
I tried to escape \# and enclose in brackets [#], the usual tricks, but that didn't work. I could try something like (^|\s) (beginning of line or whitespace) but that's exactly the problem that word-boundary tries to solve.
Highlighting works on simplified regular expressions, so this is more a question of finding the right regex than anything else. What am I missing?
# is a special character only if you have enabled the “very magic”
mode by having \v somewhere in the pattern prior to that #.
You have another problem here: # does not start a new word. \< is
not just “word boundary” like perl/PCRE’s \b, but “left word
boundary” (in help: “beginning of the word”) meaning that \< must be
followed by some keyword character. As # is not normally a keyword
character, pattern \<# will never match. (And even if it was like
\b, it would match constructs like abc#def which is definitely not
what you want for the aforementioned reasons.)
You should use \k\#<!#\k\S* instead: \k\#<! ensures that # is not preceded by any keyword character, \k\S* makes sure that first character of the name is a keyword one (you could probably also use #\<\S\+).
There is another solution: include # into 'iskeyword' option and leave the regex as is:
:setlocal iskeyword+=#-#
See :help 'isfname' for the explanation why #-# is used here.
(The 'iskeyword' option has exactly the same syntax and will,
in fact, redirect you there for the explanation.)

Syntax highlighting for regular expressions in Vim

Whenever I look at regular expressions of any complexity, my eyes start to water. Is there any existing solution for giving different colors to the different kinds of symbols in a regex expression?
Ideally I'd like different highlighting for literal characters, escape sequences, class codes, anchors, modifiers, lookaheads, etc. Obviously the syntax changes slightly across languages, but that is a wrinkle to be dealt with later.
Bonus points if this can somehow coexist with the syntax highlighting Vim does for whatever language is using the regex.
Does this exist somewhere, or should I be out there making it?
Regular expressions might not be syntax-highlighted, but you can look into making them more readable by other means.
Some languages allow you to break regular expressions across multiple lines (perl, C#, Javascript). Once you do this, you can format it so it's more readable to ordinary eyes. Here's an example of what I mean.
You can also use the advanced (?x) syntax explained here in some languages. Here's an example:
(?x: # Find word-looking things without vowels, if they contain an "s"
\b # word boundary
[^b-df-hj-np-tv-z]* # nonvowels only (zero or more)
s # there must be an "s"
[^b-df-hj-np-tv-z]* # nonvowels only (zero or more)
\b # word boundary
)
EDIT:
As Al pointed out, you can also use string concatenation if all else fails. Here's an example:
regex = "" # Find word-looking things without vowels, if they contain an "s"
+ "\b" # word boundary
+ "[^b-df-hj-np-tv-z]*" # nonvowels only (zero or more)
+ "s" # there must be an "s"
+ "[^b-df-hj-np-tv-z]*" # nonvowels only (zero or more)
+ "\b"; # word boundary
This Vim plugin claims to do syntax higlighting:
http://www.vim.org/scripts/script.php?script_id=1091
I don't think it's exactly what you want, but I guess it's adaptable for your own use.
Vim already has syntax highlighting for perl regular expressions. Even if you don't know perl itself, you can still write your regex in perl (open a new buffer, set the filetype to perl and insert '/regex/') and the regex will work in many other languages such as PHP, Javascript or Python where they have used the PCRE library or copied Perl's syntax.
In a vimscript file, you can insert the following line of code to get syntax highlighting for regex:
let testvar =~ "\(foo\|bar\)"
You can play around with the regex in double-quotes until you have it working.
It is very difficult to write syntax highlighting for regex in some languages because the regex are written inside quoted strings (unlike Perl and Javascript where they are part of the syntax). To give you an idea, this syntax script for PHP does highlight regex inside double- and single-quoted strings, but the code to highlight just the regex is longer than most languages' entire syntax scripts.

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.