What's the difference between () and [] in a regex? - regex

Let's say:
/(a|b)/ vs /[ab]/

There's not much difference in your above example (in most languages). The major difference is that the () version creates a group that can be backreferenced by \1 in the match (or, sometimes, $1). The [] version doesn't do this.
Also,
/(ab|cd)/ # matches 'ab' or 'cd'
/[abcd]/ # matches 'a', 'b', 'c' or 'd'

() in regular expression is used for grouping regular expressions, allowing you to apply operators to an entire expression rather than a single character. For instance, if I have the regular expression ab, then ab* refers to an a followed by any number of bs (for instance, a, ab, abb, etc), while (ab)* refers to any number of repetitions of the sequence ab (for instance, the empty string, ab, abab, etc). In many regular expression engines, () are also used for creating references that can be referred to after matching. For instance, in Ruby, after you execute "foo" =~ /f(o*)/, $1 will contain oo.
| in a regular expression indicates alternation; it means the expression before the bar, or the expression after it. You could match any digit with the expression 0|1|2|3|4|5|6|7|8|9. You will frequently see alternation wrapped in a set of parentheses for the purposes of grouping or capturing a sub-expression, but it is not required. You can use alternation on longer expressions as well, like foo|bar, to indicate either foo or bar.
You can express every regular expression (in the formal, theoretical sense, not the extended sense that many languages use), with just alternation |, kleene closure *, concatenation (just writing two expressions next to each other with nothing in between), and parentheses for grouping. But that would be rather inconvenient for complicated expressions, so several shorthands are commonly available. For instance, x? is just a shorthand for |x (that is, the empty string or x), while y+ is a shorthand for yy*.
[] are basically a shorthand for the alternation | of all of the characters, or ranges of characters, within it. As I said, I could write 0|1|3|4|5|6|7|8|9, but it's much more convenient to write [0-9]. I can also write [a-zA-Z] to represent any letter. Note that while [] do provide grouping, they do not generally introduce a new reference that can be referred to later on; you would have to wrap them in parentheses for that, like ([a-zA-Z])
So, your two example regular expressions are equivalent in what they match, but the (a|b) will set the first sub-match to the matching character, while [ab] will not create any references to sub-matches.

First, when speaking about regexes, it's often important to specify what sort of regexes you're talking about. There are several variations (such as the traditional POSIX regexes, Perl and Perl-compatible regexes (PCRE), etc.).
Assuming PCRE or something very similar, which is often the most common these days, there are three key differences:
Using parenthetical groups, you can check options consisting of more than one character. So /(a|b)/ might instead be /(abc|defg)/.
Parenthetical groups perform a capture operation so that you can extract the result (so that if it matched on "b", you can get "b" back and see that). /[ab]/ does not. The capture operation can be overridden by adding ?: like so: /(?:a|b)/
Even if you override the capture behavior of parentheses, the underlying implementation may still be faster for [] when you're checking single characters (although nothing says non-capturing (?:a|b) can't be optimized as a special case into [ab], but regex compilation may take ever so slightly longer).

Related

^ and $ expressed in fundamental operations in regular expressions

I've read a book where it states that all fundamental operations in regular expressions are concatatenation, or(|), closure(*) and parenthesis to override default precedence. Every other operation is just a shortcut for one or more fundamental operations.
For example, (AB)+ shortcut is expanded to (AB)(AB)* and (AB)? to (ε | AB) where ε is empty string. First of all, I looked up ASCII table and I am not sure which charcode is designated to empty string. Is it ASCII 0?
I'd like to figure out how to express the shortcuts ^ and $ as in ^AB or AB$ expression in the fundamental operations, but I am not sure how to do this. Can you help me out how this is expressed in fundamentals?
Regular expressions, the way they are defined in mathematics, are actually string generators, not search patterns. They are used as a convenient notation for a certain class of sets of strings. (Those sets can contain an infinite number of strings, so enumerating all elements is not practical.)
In a programming context, regexes are usually used as flexible search patterns. In mathematical terms we're saying, "find a substring of the target string S that is an element of the set generated by regex R". This substring search is not part of the regex proper; it's like there's a loop around the actual regex engine that tries to match every possible substring against the regex (and stops when it finds a match).
In fundamental regex terms, it's like there's an implicit .* added before and after your pattern. When you look at it this way, ^ and $ simply prevent .* from being added at the beginning/end of the regex.
As an aside, regexes (as commonly used in programming) are not actually "regular" in the mathematical sense; i.e. there are many constructs that cannot be translated to the fundamental operations listed above. These include backreferences (\1, \2, ...), word boundaries (\b, \<, \>), look-ahead/look-behind assertions ((?= ), (?! ), (?<= ), (?<! )), and others.
As for ε: It has no character code because the empty string is a string, not a character. Specifically, a string is a sequence of characters, and the empty string contains no characters.
^AB can be expressed as (εAB) ie an empty string followed by AB and AB$ can be expressed as (ABε) that's AB followed by an empty string.
The empty string is actually defined as '', that's a string of 0 length, so has no value in the ASCII table. However the C programming language terminates all strings with the ASCII NULL character, although this is not counted in the length of the string it still must be accounted for when allocating memory.
EDIT
As #melpomene pointed out in their comment εAB is equivalent to AB which makes the above invalid. Having talked to a work college I'm no longer sure how to do this or even if it's possible. Hopefully someone can come up with an answer.

Regex character interval with exception

Say I have an interval with characters ['A'-'Z'], I want to match every of these characters except the letter 'F' and I need to do it through the ^ operator. Thus, I don't want to split it into two different intervals.
How can I do it the best way? I want to write something like ['A'-'Z']^'F' (All characters between A-Z except the letter F). This site can be used as reference: http://regexr.com/
EDIT: The relation to ocaml is that I want to define a regular expression of a string literal in ocamllex that starts/ends with a doublequote ( " ) and takes allowed characters in a certain range. Therefore I want to exclude the doublequotes because it obviously ends the string. (I am not considering escaped characters for the moment)
Since it is very rare to find two regular expressions libraries / processors with exactly the same regular expression syntax, it is important to always specify precisely which system you are using.
The tags in the question lead me to believe that you might be using ocamllex to build a scanner. In that case, according to the documentation for its regular expression syntax, you could use
['A'-'Z'] # 'F'
That's loosely based on the syntax used in flex:
[A-Z]{-}[F]
Java and Ruby regular expressions include a similar operator with very different syntax:
[A-Z&&[^F]]
If you are using a regular expression library which includes negative lookahead assertions (Perl, Python, Ecmascript/C++, and others), you could use one of those:
(?!F)[A-Z]
Or you could use a positive lookahead assertion combined with a negated character class:
(?=[A-Z])[^F]
In this simple case, both of those constructions effectively do a conjunction, but lookaround assertions are not really conjunctions. For a regular expression system which does implement a conjunction operator, see, for example, Ragel.
The ocamllex syntax for character set difference is:
['A'-'Z'] # 'F'
which is equivalent to
['A'-'E' 'G'-'Z']
(?!F)[A-Z] or ((?!F)[A-Z])*
This will match every uppercase character excluding 'F'
Use character class subtraction:
[A-Z&&[^F]]
The alternative of [A-EG-Z] is "OK" for a single exception, but breaks down quickly when there are many exceptions. Consider this succinct expression for consonants (non-vowels):
[B-Z&&[^EIOU]]
vs this train wreck
[B-DF-HJ-NP-TV-Z]
The regex below accomplishes what you want using ^ and without splitting into different intervals. It also resambles your original thought (['A'-'Z']^'F').
/(?=[A-Z])[^F]/ig
If only uppercase letters are allowed simple remove the i flag.
Demo

Do not include the condition itself in regex

Here's the regexp:
/\.([^\.]*)/g
But for string name.ns1.ns2 it catches .ns1 and .ns2 values (which does make perfect sense). Is it possible only to get ns1 and ns2 results? Maybe using assertions, nuh?
You have the capturing group, use its value, however you do it in your language.
JavaScript example:
var list = "name.ns1.ns2".match(/\.([^.]+)/g);
// list now contains 'ns1' and 'ns2'
If you can use lookbehinds (most modern regex flavors, but not JS), you can use this expression:
(?<=\.)[^.]+
In Perl you can also use \K like so:
\.\K[^.]+
I'm not 100% sure what you're trying to do, but let's go through some options.
Your regex: /\.([^\.]*)/g
(Minor note: you don't need the backslash in front of the . inside a character class [..], because a . loses its special meaning there already.)
First: matching against a regular expression is, in principle, a Boolean test: "does this string match this regex". Any additional information you might be able to get about what part of the string matched what part of the regex, etc., is entirely dependent upon the particular implementation surrounding the regular expression in whatever environment you're using. So, your question is inherently implementation-dependent.
However, in the most common case, a match attempt does provide additional data. You almost always get the substring that matched the entire regular expression (in Perl 5, it shows up in the $& variable). In Perl5-compatible regular expressions, f you surround part of the regular expression with unquoted parentheses, you will additiionally get the substrings that matched each set of those as well (in Perl 5, they are placed in $1, $2, etc.).
So, as written, your regular expression will usually make two separate results available to you: ".ns1", ".ns2", etc. for the entire match, and "ns1", "ns2", etc. for the subgroup match. You shouldn't have to change the expression to get the latter values; just change how you access the results of the match.
However, if you want, and if your regular expression engine supports them, you can use certain features to make sure that the entire regular expression matches only the part you want. One such mechanism is lookbehind. A positive lookbehind will only match after something that matches the lookbehind expression:
/(?<\.)([^.]*)/
That will match any sequence of non-periods but only if they come after a period.
Can you use something like string splitting, which allows you to break a string into pieces around a particular string (such as a period)?
It's not clear what language you're using, but nearly every modern language provides a way to split up a string. e.g., this pseudo code:
string myString = "bill.the.pony";
string[] brokenString = myString.split(".");

Different regex evaluation in collections or patterns

I am experiencing a strange behaviour when searching for a regular expression in vim:
I attempt to clean up superfluous whitespace in a file and want to use the substitute command for it.
When I use the following regular expression with collections, vim matches single whitespaces as well:
\%[\s]\{2,}
When I use the same regular expression with patterns instead of collections vim correctly matches only 2 or more whitespaces:
\%(\s\)\{2,}
I know that I do not need to use a collection, but if I try the expression in a online regular expression parser (e.g. Rubular) it works with a collection as well.
Can anyone explain why these expression are not evaluated in the same way?
Because \%[...] and \%(...\) are completely different patterns.
\%[...] means a sequence of optional atoms.
For example, r\%[ead] matches "read", "rea", "re" and "r".
While \%(...\) treats the enclosed atoms as a single atom.
For example, r\%(ead\) matches only "read".
So that,
\%[\s]\{2,} can be interpreted as \(\s\|\)\{2,}, then \(\s\|\)\(\s\|\)\|\(\s\|\)\(\s\|\)\(\s\|\)\|....
Here \(\s\|\)\(\s\|\), the minimum pattern, can be interpreted as \(\)\(\), \(\)\(\s\), \(\s\)\(\) or \(\s\)\(\s\).
It matches 1 whitespace character too.
\%(\s\)\{2,} can be interpreted as \s\{2,}, then \s\s\|\s\s\s\|....
It matches only 2 or more whitespace characters.
does this answer your question?
http://vimdoc.sourceforge.net/htmldoc/pattern.html#/\%[]
A sequence of optionally matched atoms. This always matches.
It matches as much of the list of atoms it contains as possible.
Thus it stops at the first atom that doesnt match.
For example:
/r\%[ead]
matches "r", "re", "rea" or "read". The longest that matches is used.
The problem is it always match and override the quantifier {2,} at the back.
it is rarely used, but interesting nevertheless.

Regular Expression Opposite

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)