Do regex implementations actually need a split() function? - regex

Is there any application for a regex split() operation that could not be performed by a single match() (or search(), findall() etc.) operation?
For example, instead of doing
subject.split('[|]')
you could get the same result with a call to
subject.findall('[^|]*')
And in nearly all regex engines (except .NET and JGSoft), split() can't do some things like "split on | unless they are escaped \|" because you'd need to have unlimited repetition inside lookbehind.
So instead of having to do something quite unreadable like this (nested lookbehinds!)
splitArray = Regex.Split(subjectString, #"(?<=(?<!\\)(?:\\\\)*)\|");
you can simply do (even in JavaScript which doesn't support any kind of lookbehind)
result = subject.match(/(?:\\.|[^|])*/g);
This has led me to wondering: Is there anything at all that I can do in a split() that's impossible to achieve with a single match()/findall() instead? I'm willing to bet there isn't, but I'm probably overlooking something.
(I'm defining "regex" in the modern, non-regular sense, i. e., using everything that modern regexes have at their disposal like backreferences and lookaround.)

The purpose of regular expressions is to describe the syntax of a language. These regular expressions can then be used to find strings that match the syntax of these languages. That’s it.
What you actually do with the matches, depends on your needs. If you’re looking for all matches, repeat the find process and collect the matches. If you want to split the string, repeat the find process and split the input string at the position the matches where found.
So basically, regular expression libraries can only do one thing: perform a search for a match. Anything else are just extensions.
A good example for this is JavaScript where there is RegExp.prototype.exec that actually performs the match search. Any other method that accepts regular expression (e. g. RegExp.prototype.test, String.prototype.match, String.prototype.search) just uses the basic functionality of RegExp.prototype.exec somehow:
// pseudo-implementations
RegExp.prototype.test = function(str) {
return RegExp(this).exec(str);
};
String.prototype.match = function(pattern) {
return RegExp(pattern).exec(this);
};
String.prototype.search = function(pattern) {
return RegExp(pattern).exec(this).index;
};

Related

Why doesn't regex support inverse matching?

Several sources linked below seem to indicate regex wasn't designed for inverse matching - why not?
Recently, while trying to put together an answer for a question about a regex to match everything that was left after a specific pattern, I encountered several issues that left me curious about the limitations of regex.
Suppose we have some string: a simple line of text. I have a regex [a-zA-Z]e that will match one letter, followed by an e. This matches 3 times, on le, ne, and te. What if I want to match everything except patterns that match the regex? Suppose I want to capture a simp, li, of, and xt., including spaces (line breaks optional.) I later learned this behavior is called inverse matching, and shortly after, that it's not something regex easily supports.
I've examined some resources, but couldn't find any concrete answer on why inverse matching isn't "good".
Negative lookaheads appear useful for determining if a matched string does not contain some specific string, and are in fact used in several answers as methods to achieve this behavior (or something similar) - but they seem designed to act as a way to disqualify matches, as opposed to capturing non-matching input.
Negative lookaheads apparently shouldn't try to do this and aren't good at it either, choosing to leave inverse matching to the language they're being used with.
My own attempt at inverse matching was pointed out to be situational and very fragile, and looks convoluted even to me. In the comments, Wiktor Stribizew mentioned that "[...] in Java, you can't write a regex that matches any text other than some multicharacter string. With capturing, something can be done, but it is inefficient[.]"
Capture groups (the other method I was considering) appear to have the potential to dramatically slow the regex in more than one language.
All of these seem to indicate regex wasn't designed for inverse pattern matching, but none of them are immediately obvious as to the reasoning behind that. Why wasn't regex designed with built-in ability to perform inverse pattern matching?
While direct regex, as you pointed out, does not easily support the functionality you want, a regex split, does easily support this. Consider the following two scripts, first in Java and then in Python:
String input = "a simple line of text.";
String[] parts = input.split("[a-z]e");
System.out.println(Arrays.toString(parts));
This prints:
[a simp, li, of , xt.]
In Python, we can try something very similar:
inp = "a simple line of text."
parts = re.split(r'[a-z]e', inp)
print(parts)
This prints:
['a simp', ' li', ' of ', 'xt.']
The secret sauce which is missing in pure regex is that of parsing or iteration. A good programming language, such as the above, will expose an API which can iterate an input string, using a supplied pattern, and rollup the portions from the split pattern.

How to concisely regex match any portion of a unique string?

Context:
Say I have a set of strings that are all distinct, though they may share starting sequences, i.e. apple, banana, bpple, canana, applf.
How best would I use a regex to match on a string that can contain any left-starting subset of one of those strings? For example apple and banana would obviously match. So would banan, ba, bp and c. b and appl would be ambiguous (and therefore should not match).
Using generated character classes in dynamically-built regexes (slow and ugly), I can make a match engine for this. However, it's complicated to the point that when I try, I end up doing most of the matching logic in Python/pick-your-language and ditching regex altogether. Is there some succinct way to make this work with regular expressions?
The simplest way to do this might be to break out each possible string (apple, banana etc) into a list and match against each one in sequence, but curiosity and stubbornness make me wonder if there isn't some way to do it with regex alone/primarily.
TL;DR:
Is there a way, using regex, to match: if and only if the string supplied is a unique and left-starting substring of only one of a given set of strings?
Don't use regular expressions. You are asking for the leaves in a trie.
If you absolutely have to use regular expressions, then they could be built like this:
(a(p(p(le?)?)?)?|b(a(n(a(na?)?)?)?)? ...)
It is easy to write some code that constructs this, but you won't be able to find out what actually matched (e. g. the user enters 'app' - you probably want to know that this matches 'apple'). Also, modifying this to ensure that there is no more than one match is really cumbersome. The code that constructs the regex will be much more complicated than just creating a trie (in fact, you probably have to create something equivalent to a trie in order to create the regex, you are asking for).

Do not include the condition itself in regex

Here's the regexp:
/\.([^\.]*)/g
But for string name.ns1.ns2 it catches .ns1 and .ns2 values (which does make perfect sense). Is it possible only to get ns1 and ns2 results? Maybe using assertions, nuh?
You have the capturing group, use its value, however you do it in your language.
JavaScript example:
var list = "name.ns1.ns2".match(/\.([^.]+)/g);
// list now contains 'ns1' and 'ns2'
If you can use lookbehinds (most modern regex flavors, but not JS), you can use this expression:
(?<=\.)[^.]+
In Perl you can also use \K like so:
\.\K[^.]+
I'm not 100% sure what you're trying to do, but let's go through some options.
Your regex: /\.([^\.]*)/g
(Minor note: you don't need the backslash in front of the . inside a character class [..], because a . loses its special meaning there already.)
First: matching against a regular expression is, in principle, a Boolean test: "does this string match this regex". Any additional information you might be able to get about what part of the string matched what part of the regex, etc., is entirely dependent upon the particular implementation surrounding the regular expression in whatever environment you're using. So, your question is inherently implementation-dependent.
However, in the most common case, a match attempt does provide additional data. You almost always get the substring that matched the entire regular expression (in Perl 5, it shows up in the $& variable). In Perl5-compatible regular expressions, f you surround part of the regular expression with unquoted parentheses, you will additiionally get the substrings that matched each set of those as well (in Perl 5, they are placed in $1, $2, etc.).
So, as written, your regular expression will usually make two separate results available to you: ".ns1", ".ns2", etc. for the entire match, and "ns1", "ns2", etc. for the subgroup match. You shouldn't have to change the expression to get the latter values; just change how you access the results of the match.
However, if you want, and if your regular expression engine supports them, you can use certain features to make sure that the entire regular expression matches only the part you want. One such mechanism is lookbehind. A positive lookbehind will only match after something that matches the lookbehind expression:
/(?<\.)([^.]*)/
That will match any sequence of non-periods but only if they come after a period.
Can you use something like string splitting, which allows you to break a string into pieces around a particular string (such as a period)?
It's not clear what language you're using, but nearly every modern language provides a way to split up a string. e.g., this pseudo code:
string myString = "bill.the.pony";
string[] brokenString = myString.split(".");

emacs: Is it possible to match strings with balanced parens with emacs regex?

Something like this:
http://perl.plover.com/yak/regex/samples/slide083.html
In other words I want to match successfully on { { foo } { bar} } but not on { { foo } .
I see it's possible in perl, and in .NET. Is it possible in emacs regex?
No, so far Perl/PCRE and .NET are the only regex flavors that support arbitrary nesting (recursive patterns).
No, but if you have a particular use case to discuss you'll often find that you don't need regexes. Simple state-machines to match parenthases are pretty simple to write in lisp. Looking at the source of Paredit is a good place to start.
If you are still interested have a look at cexp.el.
It is just a hack but maybe serves your purpose.
You can search for combined regular and balanced expressions with cexp-search-forward.
The built-in re-search-forward is used for regular expressions and so its syntax rules apply. Balanced expressions can be matched with the additional syntax elements \!( and \!).
The most serious restriction is that balanced expressions may not occur in groups. So a construct like \!(^{ \(\!(^{.*}$\!)\)+ }$\!) does not work because of the group containing the inner balanced expression.
Nevertheless, one useful example is matching TeX-definitions like
\def\mdo#1{{\def\next{\relax}\def\tmp{#1}\ifx\next\tmp\else\def\next{#1\mdo}\expandafter}\next}
with combined expressions like
\\def\\[[:alpha:]]+\(#[0-9]\)*\!(^{.*}$\!)
The search via cexp-search-forward with the above cexp returns the limits for the following groups:
The beginning and the end of the full match
The limits of the match for the regular expression before the balanced expression, i.e. \def\mdo#1
The limits of the captured group in the first regular expression, i.e., #1
The limits of the balanced expression, i.e., {{\def\next{\relax}\def\tmp{#1}\ifx\next\tmp\else\def\next{#1\mdo}\expandafter}\next}

Regular expression listing all possibilities

Given a regular expression, how can I list all possible matches?
For example: AB[CD]1234, I want it to return a list like:
ABC1234
ABD1234
I searched the web, but couldn't find anything.
Exrex can do this:
$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234
The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:
Consider following expressions:
AB[A-Z0-9]{1,10}1234
AB.*1234
I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.
For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.
One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.
For an expression like "A[BC][DE]F", the method will proceed as follows
getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF")
+ getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"
It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:
Optional: ?
Characters: . \d \D
Sets: like [1a-c]
Negated sets: [^2-9d-z]
Alternations: |
Positive lookarounds
So your regexes could NOT use:
Repeaters: * +
Word patterns: \w \W
Negative lookarounds
Some zero-width assertions: ^ $
And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.
As for the algorithm itself, another user posted a link to this answer which describes how to create it.
Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted
A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.
Impossible.
Really.
Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?
It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.
For example AB.*1234 would be AB followed by absolutely anything and then 1234.
I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.
For instance, try to make a list of all matches for A.*Z
There are sites that help with building a good regular expression though:
http://www.fileformat.info/tool/regex.htm
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/