removing redundancy from regex with multiple possible delimiters - regex

I have a regex in which the same match criteria can apply to multiple delimiters. [], (), and <> are all valid. For example purposes it looks like this:
\[.\]|\(.\)|<.>
Is there some way to remove the redundancy from the above regex? The match criteria inside the delimiters is always the same, but the delimiters themselves may be different.

I'm guessing you're asking because
[[(<].[])>]
isn't exact enough, for obvious reasons.
It's always dangerous to answer, "No, there is no way," because it's hard to be sure one has checked every possible way. One must often come up with a solid proof to answer in such cases.
I'm not sure this is a strong-enough proof, or even a "proof" at all, but consider this (pseudo-)information-theory perspective:
The PCRE engine itself has no knowledge of any relation between the pairs of characters, [], (), and <>. Thus, the expression itself must contain that information, i.e. require at least the six characters []()<> to be present.
Not only that, but for the same reason, the expression itself must define at least two pairings (leaving the third to be implied). I'm not sure how to prove that two alternation operators (|) is the best you can do, but I mean, even if there were a more compact way, you're going to save one character at most, since at least one bit is required to say, "Pairings exist!"
The escaping of meta-characters can only be compacted by the fact that []() can appear within character classes without being escaped, but firstly, that isn't really a "removal of redundancy" as much as it is "a lucky circumstance in syntax", and secondly, you still have to add two characters for the definition of said character class: [].
Therefore, it is my belief that even from a theoretical perspective, if my presumptions about what a regex engine cannot know are true, then one can save at most three characters from the regex you've already provided: \[.\]|\(.\)|<.>.
I eagerly look forward to being corrected by the regex gurus!

If you really are using the PCRE library (via PHP, for example) you can use a DEFINE group to create a subroutine, like so:
'~(?(DEFINE)(?<content>\w+))(?:<(?&content)>|\[(?&content)\]|\((?&content)\))~'
...or more readably:
(?(DEFINE)(?<content>\w+))
(?:
<(?&content)>
|
\[(?&content)\]
|
\((?&content)\)
)
Here's a demo in PHP. It should work in Perl, too.

Related

Regex to match hexadecimal and integer numbers [duplicate]

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.
The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/
Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A
Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).
Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.
Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

differentiating and testing regex variants

Several implementations of regular expressions differ from each other in subtle ways which is the source of much confusion when I try to use them.
Most of these differences include the semantics related to whether a character is escaped or not. This is most often an issue with parentheses, but can apply to curly brackets and others. This is probably a consequence of the syntax of the language or environment in which the implementation is found. For instance, if the $ symbol indicates a variable name in some language, one can expect regular expressions represented in that language would require escaping the "end of line" anchor to \$ or some such. But what gets confusing at this point is how you would represent an actual dollar sign. I believe Perl gets around this by wrapping a regex inside forward slashes /.
Similarly there are escapes for specific characters themselves, for instance non printing characters such as \n and \t. Then there are the similar looking generic character groups such as \d for digits, \s for whitespace, and \w which I just learned covers underscores as well as digits. I found myself on several occasions trying to use \a for a "alphabetical" group but this only ended up matching the bell character 0x07.
It's pretty clear that there is no simple and one-shot solution to knowing all of the differences in features and syntax offered by the myriad of implementations of regular expressions out there, short of somebody doing all the hard work and putting results in a well organized table. Here is one example of exactly this, but of course it doesn't cover several of the programs that I use extensively myself, which include vim, sed, Notepad++, Eclipse, and believe it or not MS Word (at least version 2010, I suspect 2007 also has this, they call it "wildcards") has a simple regex implementation too.
I guess what I want is to be as lazy as possible (in a certain sense) by trying to come up with a way to determine for any given regex implementation what its "escape settings" are beyond any doubt by applying one (or a few) queries.
I'm thinking I can make a file which contains test cases, along with a huge regex query, and somehow engineer it so that running it once will show me exactly what syntax I need to use subsequently without doubting myself any further. (as opposed to having to edit files and use multiple queries to figure out the same thing which gets terribly old after a while).
If nobody else has attempted to construct such a monstrosity, I may undertake this task myself. If it's even possible. Is this possible?
I tried to come up with an example (it was just to figure out if EOL anchor is $ or \$) but in every case I had to use a multitude of different search/replace queries in order to determine how the program will respond to the input.
Edit: I came up with something using capturing and backtracking. I gotta work on it a little more.
Update: Well, Notepad++ does not implement the OR operator commonly denoted by the pipe |. Word's "wildcards" is a poor substitute also, it doesn't have | or *. I'm fairly certain that missing any of the regular expression operators (union, concat, star) means it cannot generate a regular grammar, so those two are ruled out.
I can create an input file like this:
$
*
]
EOL
and query
(\$)|(\*)|(\[)|($)
replacing with
escDollar:\1:escStar:\2:escSQBrL:\3:Dollar:\4:
yields a result of (assuming unescaped parens is group and unescaped pipe is or)
escDollar:$:escStar::escSQBrL::Dollar::
escDollar::escStar:*:escSQBrL::Dollar::
]escDollar::escStar::escSQBrL::Dollar::
EOLescDollar::escStar::escSQBrL::Dollar::
I ran this in vim. This output would demonstrate the single characters that are matched by each item specified next to it, i.e. the escaped dollar sign item is seen to match the actual dollar sign character rather than the non escaped dollar sign item at the end.
It's difficult to see what's going on with the $ anchor since it matches zero characters, but it shouldn't be hard to find a solution for it. Besides it's not a commonly mistaken one. The ones I'm particularly worried about are pipe and parens and the different brackets. When you've got 4 different types in there there are 2^4 combinations of escaped and non-escaped versions of them you can use. Trial-and-error with that is horrific.
This output isn't too hard to parse at a glance, and is also seriously easy to process as part of a script. The one glaring problem that remains is figuring out whether parens and pipe need to be escaped. Because the functionality of the whole thing depends on them.
It would seem like that will require multiple queries. It may be possible with a cleverly engineered jumble of backslashes, parens, and pipes to figure out the combination (only 4 possibilities after all) with an initial query, then choose the subsequent matrix generator query based on it.
Something like this shows it can work:
(e)
(f)
querying
\((f\))|\|\((e\))
replace with
\1:\2
would produce:
:(e if escaped parens is group and escaped pipe is or
:e) if parens is group and escaped pipe is or
(f: if escaped parens is group and pipe is or
f): if parens is group and pipe is or
I still don't really like this though because it requires a second query on a second set of input. Too much setting up. I may just make 4 copies of the "matrix" thing.
The table on this page summarizes quite nicely which features are available in which regex implementations:
http://www.regular-expressions.info/refflavors.html

regex match upto some character

Conditions updated
There is often a situation where you want to extract a substring upto (immediately before) certain characters. For example, suppose you have a text that:
Does not start with a semicolon or a period,
Contains several sentences,
Does not contain any "\n", and
Ends with a period,
and you want to extract the sequence from the start upto the closest semicolon or period. Two strategies come to mind:
/[^;.]*/
/.*?[;.]/
I do either of these quite randomly, with slight preference to the second strategy, and also see both ways in other people's code. Which is the better way? Is there a clear reason to prefer one over the other, or are there better ways? I personally feel, efficiency aside, that negating something (as with [^]) is conceptually more complex than not doing it. But efficiency may also be a good reason to chose one over the other.
I came up with my answer. The two regexes in my question were actually not expressing the same thing. And the better approach depends on what you want.
If you want a match up to and including a certain character, then using
/.*?[;.]/
is simpler.
If you want a match up to right before (excluding) a certain character, then you should use:
/[^;.]*/
Well, the first way is probably more efficient, not that it's likely to matter. By the way, \z in a character class does not mean "end of input"--in fact, it's a syntax error in every flavor I know of. /[^;.]*/ is all you need anyway.
I personally prefer the first one because it does exactly as you would expect. Get all characters except ...
But it's mostly a matter of preference. There are nearly always multiple ways to write a regular expression and it's mostly style that matters.
For example... do you prefer [0-9], [:digit:] or \d? They all do exactly* the same.
* In case of unicode the [:digit:] and \d classes match some other characters too.
you left out one other strategy. string split?
"my sentence; blahblah".split(/[;.]/,2)[0]
I think that it is mostly a matter of opinion as to which regular expression you use. On the note of efficiency, though, I think that adding \A to the beginning of a regular expression in this case would make the process faster because well designed regular expression engines should only try to match once in that case. For example:
/\A[^.;]/m
Note the m option; it indicates that newline characters can also be matched. This is just a technicality I would add for generic examples, but may not apply to you.
Although adding more to the solution might be viewed as increasing complexity, it can also serve to clarify meaning.

boost::regex - \bb?

I have some badly commented legacy code here that makes use of boost::regex::perl. I was wondering about one particular construct before, but since the code worked (more or less), I was loath to touch it.
Now I have to touch it, for technical reasons (more precisely, current versions of Boost no longer accepting the construct), so I have to figure out what it does - or rather, was intended to do.
The relevant part of the regex:
(?<!(\bb\s|\bb|^[a-z]\s|^[a-z]))
The piece that gives me headaches is \bb. I know of \b, but I could not find mention of \bb, and looking for a literal 'b' would not make sense here. Is \bb some special underdocumented feature, or do I have to consider this a typo?
As Boost seems to be a regex engine for C++, and one of the compatibility modes is perl compatibility--if that is a "perl-compatible" expression, than the second 'b' can only be a literal.
It's a valid expression, pretty much a special case for words beginning with 'b'.
It seems to be the deciding factor that this is a c++ library, and that it's to give environments that aren't perl, perl-compatible regexes. Thus my original thought that perl might interpret the expression (say with overload::constant) is invalid. Yet it is still worth mentioning just for clarification purposes, regardless of how inadvisable it would be tweak an expression meaning "word beginning with 'b'".
The only caveat to that idea is that perhaps Boost out-performs Perl at it's own expressions and somebody would be using the Boost engine in a Perl environment, then all bets are off as to whether that could have been meant as a special expression. This is just one stab, given a grammar where '!!!' meant something special at the beginning of words, you could piggyback on the established meaning like this (NOT RECOMMENDED!)
s/\\bb\b/(?:!!!(\\p{Alpha})|\\bb)/
This would be something dumb to do, but as we are dealing with code that seems unfit for its task, there are thousands of ways to fail at a task.
(\bb\s|\bb|^[a-z]\s|^[a-z]) matches a b if it's not preceded by another word character, or any lowercase letter if it's at the beginning of the string. In either case, the letter may be followed by a whitespace character. (It could match uppercase letters too if case-insensitive mode is set, and the ^ could also match the beginning of a line if multiline mode is set.)
But inside a lookbehind, that shouldn't even have compiled. In some flavors, a lookbehind can contain multiple alternatives with different, fixed lengths, but the alternation has to be at the top level in the lookbehind. That is, (?<=abc|xy|12345) will work, but (?<=(abc|xy|12345)) won't. So your regex wouldn't work even in those flavors, but Boost's docs just say the lookbehind expression has to be fixed-length.
If you really need to account for all four of the possibilities matched by that regex, I suggest you split the lookbehind into two:
(?<!\bb|^[a-z])(?<!(?:\bb|^[a-z])\s)

Regex: Is Lazy Worse?

I have always written regexes like this
([^<]*)
but I just learned about this lazy thing and that I can write it like this
(.*?)
is there any disadvantage to using this second approach? The regex is definitely more compact (even SO parses it better).
Edit: There are two best answers here, which point out two important differences between the expressions. ysth's answer points to a weakness in the non-greedy/lazy one, in which the hyperlink itself could possibly include other attributes of the A tag (definitely not good). Rob Kennedy points out a weakness in the greedy example, in that anchor texts cannot include other tags (definitely not okay, because it wouldn't grab all the anchor text either)... so the answer is that, regular expressions being what they are, lazy and non-lazy solutions that seem the same are probably not semantically equivalent.
Edit: Third best answer is by Alan M about relative speed of the expressions. For the time being, I'll mark his as best answer so people give him more points :)
Another thing to consider is how long the target text is, and how much of it is going to be matched by the quantified subexpression. For example, if you were trying to match the whole <BODY> element in a large HTML document, you might be tempted to use this regex:
/<BODY>.*?<\/BODY>/is
But that's going to do a whole lot of unnecessary work, matching one character at a time while effectively doing a negative lookahead before each one. You know the </BODY> tag is going to be very near the end of the document, so the smart thing to do is to use a normal greedy quantitier; let it slurp up the whole rest of the document and then backtrack the few characters necessary to match the end tag.
In most cases you won't notice any speed difference between greedy and reluctant quantifiers, but it's something to keep in mind. The main reason why you should be judicious in your use of reluctant quantifiers is the one that was pointed out by the others: they may do it reluctantly, but they will match more than you want them to if that's what it takes to achieve an overall match.
The complemented character class more rigorously defines what you want to match, so whenever you can, I'd use it.
The non greedy regex will match things you probably don't want, such as:
foo
where your first .*? matches
foo" NAME="foo
Note that your examples are not equivalent. Your first regular expression will not select any links that contain other tags, such as img or b. The second regular expression will, and I expect that's probably what you wanted anyway.
Besides the difference in meaning, the only disadvantage I can think of is that support for non-greedy modifiers isn't quite as prevalent as character-class negation is. It's more widely supported than I thought, before I checked, but notably absent from the list is GNU Grep. If the regular-expression evaluators you're using support it, then go ahead and use it.
It's not about better or worse. The term I've seen the most is greedy vs. non-greedy, but however you put they do two different things. You want to use the right one for the task. I.e. turn off the greedy option when you don't want to capture multiple matches in a line.
“lazy” is the wrong word here. You mean non-greedy as opposed to greedy. There's no disadvantage in using it, that I know of. But in your special case, neither should it be more efficient.
Non-greedy is better, is it not? It works forward, checking for a match each time and stopping when it finds one, whereas the normal kleene closure (*) works backwards matching the rest of the input and removing things until it finds a match.
In the end, they do different things, but I think non-greedy outperforms greedy. Bear in mind that I haven't tested this, but now I'm curious.