Regular expression explanation for vim - regex

If I want all the lines with the text 'ruby' but not 'myruby' then this is what I would do.
:g/\<ruby\>/
My question is what is the meaning of lesser than and greater than symbol here? The only regular expression I have used is while programming in ruby.
Similarly if I want to find three consecutive blank lines then this is what I would do
/^\n\{3}
My question is why I am escaping the first curly brace ( opening curly brace ) but not escaping the second curly brace ( closing curly brace )?

Vim's rules for backslash-escaping in regexes are not consistent. You have to escape the opening brace of\{...}, but [...] requires no escaping at all, and a capture group is \(...\) (escaping both open and close paren). There are other inconsistencies as well.
Thankfully Vim lets you change this behavior, even on a regex-by-regex basis, via the magic settings. If you put \v at the beginning of a regex, the escaping rules become more consistent; everything is "magic" except numbers, letters, and underscores, so you don't need backslashes unless you want to insert a literal character other than those.
Your first example then becomes :g/\v<ruby>/ and your second example becomes /\v^\n{3}. See :h /magic and :h /\v for more information.

the \< and \> mean word boundaries. In Perl, grep and less (to name 3 OTOH) you use \b for this, so I imagine it's the same in Ruby.
Regarding your 2nd question, the escape is needed for the whole expression {3}. You're not escaping each curly brace, but rather the whole thing together.
See this question for more.

For your first regular expression, you could also do:
:g/[^\ ]ruby\ /
This would ensure there was a space before and after your ruby keyword.

Related

Why is only ) a special character and not } or ]?

I'm reading Jan Goyvaerts' "Regular Expressions: The Complete Tutorial and Reference" to touch up on my Regex.
In the second chapter, Jan has a section on "special characters:"
Special Characters
Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {, These special characters are often called “metacharacters”. Most of them are errors when used alone.
(emphasis mine)
I understand that only open square bracket and open curly brace are special since a close brace or bracket is clearly a literal if there's no preceding open. However, why does Jan specify that close parenthesis is a special character if the other two close's aren't?
Short answer
The regex flavors in my book do not require } and ] to be escaped (except for ] in character classes in JavaScript). So I don't because I like to have as few backslashes in my regexes as possible. You can escape them if you find your regexes clearer that way.
Full answer
First of all, anyone learning about regular expressions needs to understand the importance of the qualifier "In the regex flavors discussed in this tutorial..." You cannot discuss regular expressions without stating which regex flavor(s) you're talking about.
What I wrote is true for the flavors my book (2006 edition) discusses. In those flavors, ) is treated as a token that closes a group. It is a syntax error if used without a corresponding (. So ) has a special meaning when used all on its own.
} does not have a special meaning when used all on its own. You never need to escape it with these flavors. If you wanted to match something like {7} or {7,42} literally, you only need to escape the opening {. If you want to argue that } is special because it sometimes has a special meaning, then you would have to say the same about , which becomes special in the same situation.
] does not have a special meaning outside character classes in these regex flavors. You never need to escape it outside character classes. The paragraph you quoted does not talk about special characters inside character classes. That's a totally different list (\, ], ^, and -) discussed in a later chapter.
Now as to why: most regular expressions have plenty of backslashes already. My preferred style is to escape as few characters as needed. So I never escape }. I escape ] in character classes when using JavaScript because that's the only way. But with other flavors I place ] at the start of the character class or after the negating caret so I don't need to escape it. My teaching materials teach this style. When my products RegexBuddy or RegexMagic convert or generate regular expressions, they also use as few backslashes as needed.
I often see people new to regular expressions needlessly escape characters like ", ', or / because they need to be escaped when the regular expression is quoted as a source code literal in certain programming languages. But the regular expression itself does not require these to be escaped.
I even see people escape characters like < or >. This is a bad habit because in some regex flavors \< and \> are word boundaries. This includes recent versions of PCRE (but not the PCRE that was current in 2006).
But, if you find it confusing to see unescaped } and ] used as literals, you are free to escape them in your regexes. Except for < and >, all the flavors discussed in my book allow you to escape any punctuation character to match that character literally, even if the character on its own would be a literal already.
So somebody saying that } and ] are special characters in regular expressions is not wrong if "special characters" means "characters that have a special meaning either on their own or when used in combination with other characters". But that list would also include , (quantifier), : (non-capturing group), - (mode modifier), ! (negative lookaround), < (lookbehind), and - (character class range).
But if "special characters" means "characters that have a special meaning on their own", then } and ] are not included in the list for the flavors my book covers.
The following paragraphs give an answer. I'm citing from Jan's website, not from the book, though:
If you forget to escape a special character where its use is not
allowed, such as in +1, then you will get an error message.
Most regular expression flavors treat the brace { as a literal
character, unless it is part of a repetition operator like a{1,3}.
So you generally do not need to escape it with a backslash, though you
can do so if you want. But there are a few exceptions.
Java requires
literal opening braces to be escaped.
Boost and
std::regex
require all literal braces to be escaped.
] is a literal outside character
classes.
Different rules apply inside character classes. Those are discussed in
the topic about character classes. Again, there are exceptions.
std::regex and
Ruby require closing
square brackets to be escaped even outside character classes.
It seems like he uses "needs to be escaped" as his definition for "special character", and unlike ), the ] and } characters need not be escaped in most flavours.
That said, you wouldn't be wrong calling them special characters as well. It's definitely a best practice to always escape them, and in no flavour \] and \} mean anything else than a literal ] or }.
On the other hand, they have their special meaning only inside a specific (parsing) context, namely when they follow [ and { respectively. There are similar cases: :=><!#'&, all have a non-literal meaning inside a specific context, and we wouldn't normally call these "special characters" either.
And while we could say the same about ), almost no flavour allows for it to occur on its own outside of groups, because pairs of parentheses always need to match. Its only usage is in the special context, and therefore ) is considered a special character.
Every where in a regular expression, regardless of engine and its standards, a parenthesis should be escaped to mean a literal character. Even the closing parenthesis. However, it doesn't apply to POSIX regular expressions:
) The <right-parenthesis> shall be special when matched with a preceding <left-parenthesis>, both outside a bracket expression.
But the interesting part is that POSIX has a separate definition for a right-parenthesis for times it should be treated as a special character. It doesn't have it for } or ].
Why other engines don't follow this rule?
Call it implementation peculiarities or historical reasons that have something to do with Perl as commented in PCRE source code:
/* It appears that Perl allows any characters whatsoever, other than
a closing parenthesis, to appear in arguments, so we no longer insist on
letters, digits, and underscores. */
It seems that with all that special clusters in more advanced engines treating a closing parenthesis as a special character will cost much less than implementing POSIX standard.
From experiments, it appears that unlike ), the characters ] and } are only interpreted as delimiters when the corresponding opening [ or { has been met.
Though IMO the same rule could apply to ), that's the way it is.
This might be due to the way the parser was written: parenthesis can be nested so that the balancing needs to be checked, whereas brackets/curly braces are just flagged. (For instance, [[] is a valid class definition. [[]] is also a valid pattern but understood as [\[]\].)

Can parentheses be used for order of operations in Regex?

I think I have my Regex working, but I'm not completely sure.
The regex is:
/(\n([ \t]*)){2,}/
The goal originally is to capture two or more new-lines together, so if someone types \n\n\n\n\n, I can do something with that.
However, I don't want interference between the consecutive new lines like trailing spaces and tabs...
So I still want to be able to catch \n \t \n\n \n.
I'm not sure if the parentheses are overkill.
The outer parenthesis are to signify that I want everything inside:
(\n([ \t]*))
to happen two or more times.
Then, the inner parentheses:
([ \t]*)
are to signify that I want any combination of spaces and tabs ranging from none to infinity trailing a \n to be included in that group. The reason for the inner parentheses is because I don't want it to be interpreted as (\n[ \t])* where the \n is grouped into potentially happening zero to infinity times.
My confusion stems from the fact that parentheses are used for certain things in regex, right? Not sure if it is like math.
Disclaimer: "Regex" is not a single thing; rather, it's a family of related notations supported by many different languages and tools. The below explanation pertains to the most widespread forms of regex, such as those of Perl, Java, JavaScript, Python, and PHP.
Yes, parentheses result in grouping, just as in mathematics.
In addition, parentheses normally "capture" the text they match, allowing the text to be referred to later. For example, /([a-z])\1/ matches a lowercase ASCII letter, followed by the same letter again. (So, it matches ee, but not ef.) You can disable this capturing by writing (?:...) instead of just (...).
However, just as in mathematics, you don't always need parentheses, because sometimes the default "order of operations" is appropriate. Just as we don't usually write (2x) + 3, because it's equivalent to 2x + 3, we don't usually write \n([ \t]*), because it's equivalent to \n[ \t]*.
The inner parens is not necessary. The Kleene star only works on the last match. In this case it is [ \t] not \n[ \t]. Note that in regexp every single non-special character is one match operation. Only when you need multiple characters to be counted as a single match operation would you need to use parens.
So, if you want to do "match newline followed by zero or more whitespace" you do:
\n[ \t]*
But if you want to do "match zero or more newline followed by one whitespace" you do:
(\n[ \t])*

Why is the closing bracket a special character that must be escaped to be taken as a literal?

It is clear that an opening bracket "(", among other characters, must be escaped (prefixed by a backslash) for the regex to contain a "literal opening bracket": Because there are regex options for which "(" is a lead-in.
But how comes the same holds true for the closing bracket ")"? There is no syntax construct that has ")" as a lead-in token, is there?
So why do I have to escape closing brackets for them to be taken literally?
Of course, the same question could be asked for the other closing brackets as well.
Sorry for this being a "why is this so?" question. It might possibly be un-answerable. But if there is a good reason, the only way to get to know it is by asking!
Addendum:
The rationale behind this question is:
For example, http://www.regexguru.com/2008/12/dont-escape-literal-characters-that-arent-metacharacters/ gives good reasons not to prefix characters that don´t need prefixing.
And imho, the closing bracket does not need prefixing in most cases:
Since a closing bracket without an opening one is not part of a regex group, I find it totally unlogical that it needs to be escaped in this case anyways.
Assume you want to match a group holding a closing bracket. Without escaping, this would look like this ()). Escaping the bracket like (\)) makes it much easier (if not even possible) for the regular expression to be parsed correctly and unambiguously.
In the (unescaped) regular expression (\w)), does the closing bracket belong to the group, or not, i.e., is the group closed by the first or the second )? E.g., for the string abc)d, does it match c or c)?
Of course one could omit some of the escape characters in case the meaning is not ambiguous (and the regex parser allows to do so) but what would it help? You save a character here and there, but each time you encounter a ) or another special character you have to think: "Is this a control character or a character to be matched? Is it ambiguous?" Better make it clear and consistent.
As a more specific example for tobias_k's answer:
Look at the following regex:
(a*))
looking at the string bbaaa)bb will it capture aaaor aaa)?
The result is clear with
(a*\))
versus
(a*)\)
Of course, the same question could be asked for the other closing brackets as well.
No that's not correct (or may vary with one regex engine to another).
In Javascript regex engine ] and } don't need to be escaped.
See this example:
var x = 'brackets)}]';
x.match(/]/); // works
x.match(/}/); // works
x.match(/)/); // failes
Only for the case 3, it fails with the error Unmatched ')'

Regexp Question - Negating a captured character

I'm looking for a regular expression that allows for either single-quoted or double-quoted strings, and allows the opposite quote character within the string. For example, the following would both be legal strings:
"hello 'there' world"
'hello "there" world'
The regexp I'm using uses negative lookahead and is as follows:
(['"])(?:(?!\1).)*\1
This would work I think, but what about if the language didn't support negative lookahead. Is there any other way to do this? Without alternation?
EDIT:
I know I can use alternation. This was more of just a hypothetical question. Say I had 20 different characters in the initial character class. I wouldn't want to write out 20 different alternations. I'm trying to actually negate the captured character, without using lookahead, lookbehind, or alternation.
This is actually much simpler than you may have realized. You don't really need the negative look-ahead. What you want to do is a non-greedy (or lazy) match like this:
(['"]).*?\1
The ? character after the .* is the important part. It says, consume the minimum possible characters before hitting the next part of the regex. So, you get either kind of quote, and then you go after 0-M characters until you encounter a character matching whichever quote you first ran into. You can learn more about greedy matching vs. non-greedy here and here.
Sure:
'([^']*)'|"([^"]*)"
On a successful match, the $+ variable will hold the contents of whichever alternate matched.
In the general case, regexps are not really the answer. You might be interested in something like Text::ParseWords, which tokenizes text, accounting for nested quotes, backslashed quotes, backslashed spaces, and other oddities.

Regex for matching a character, but not when it's enclosed in quotes

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.
So the following should have 2 matches
something:'firstValue':'secondValue'
something:"firstValue":'secondValue'
but this should only have 1 match
something:'no:match'
If the regular expression implementation supports look-around assertions, try this:
:(?:(?<=["']:)|(?=["']))
This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.
It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.
Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)
Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:
$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;
The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)
The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.
Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.
If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.
Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).
You can use negated character groups to do this.
[^'"]:[^'"]
You can further wrap the quotes in non-capturing groups.
(?:[^'"]):(?:[^'"])
Or you can use assertion.
(?<!['"]):(?!['"])
I've come up with the following slightly worrying construction:
(?<=^('[^']*')*("[^"]*")*[^'"]*):
It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:
'a":b':c::"':" (matches at positions 6, 8 and 9)
EDIT
Gumbo is right, using * within a look behind assertion is not allowed.
You can try to catch the strings withing the quotes
/(?<q>'|")([\w ]+)(\k<q>)/m
First pattern defines the allowed quote types, second pattern takes all Word-Digits and spaces.
Very good on this solution is, it takes ONLY Strings where opening and closing quotes match.
Try it at regex101.com