Simple regex negation - regex

Trying to match ONLY the first character in the sample below.
Sample string: C/C++/Objective C/Objective-C/ObjectiveC/objectiveC
My faulty regex: (?![O|o]bjective[ |-]?)C(?!\+\+)
Doh.

Try this:
(?<![Oo]bjective[ -]?)C(?!\+\+)
Corrections are:
Use negative lookbehind instead of negative lookahead (the (?<!...) bit).
Removed pipe character from character classes (the [...] bits).
It might also be worth adding a pair of \bs either side of the C, since your current regex will match Coconut, BBC, CFML and so on
Also worth pointing out that, inside character classes, the - is special if not the first or last character. Some people prefer to escape it even in these situations, i.e. [ \-], in case a later character is accidentally added after it.

Related

The most efficient lookahead substitute for jflex

I am writing tokenizer in jflex. I need to match words like interferon-a as one token, and words like interferon-alpha as three.
Obvious solution would be lookaheads, but they do not work in jflex. For a similar task, I wrote a function matching one additional wildcard character after the matched pattern, checking if it is a whitespace in java code and pushing it back with or without a part of the matched string.
REGEX = [:letter:]+\-[:letter:]\.
From string interferon-alpha it would match interferon-al.
Then, in Java code section it would check if the last character of the match is a whitespace. It is not, so -al would be pushed back and interferon returned.
In the case of interferon-a, whitespace would be pushed back and interferon returned.
However, this function does not work if matched string does not have anything succeeding. Also, it seems quite clunky. Hence, I was wondering if there is any 'nicer' way of ensuring that the following character is a whitespace without actually matching and returning it.
JFlex certainly has a lookahead facility, the same as (f)lex. Unlike Java regex lookahead assertions, the JFlex lookahead can only be applied at the end of a match, but it is otherwise similar. It is described in the Semantics section of JFlex manual:
In a lexical rule, a regular expression r may be followed by a look-ahead expression. A look-ahead expression is either $ (the end of line operator) or / followed by an arbitrary regular expression. In both cases the look-ahead is not consumed and not included in the matched text region, but it is considered while determining which rule has the longest match…
So you could certainly write the rule:
[:letter:]+\-[:letter:]/\s
However, you cannot put such a rule in a macro definition (REGEX = …), as the manual also mentions (in the section on macros):
The regular expression on the right hand side must be well formed and must not contain the ^, / or $ operators.
So the lookahead operator can only be used in a pattern rule.
Note that \s matches any whitespace character, including newline characters, while . does not match any newline character. I think that's what lead to your comment that REGEX = [:letter:]+\-[:letter:]\. "does not work if matched string does not have anything succeeding" (I'm guessing that you meant "does not have anything succeeding it on the same line, and also that you intended to write . rather than \.).
Rather than testing for following whitespace, you might (depending on your language) prefer to test for a non-word character:
[:letter:]+\-[:letter:]/\W
or to craft a more precise specification as a set of Unicode properties, as in the definition of \W (also found in the linked section of the JFlex manual).
Having said all that, I'd like to repeat the advice from my previous answer to a similar question of yours: put more specific patterns first. For example, using the following pair of patterns will guarantee that the first one picks up words with a single letter suffix, while avoiding the need to explicitly pushback.
[:letter:]+(-[:letter:])? { /* matches 'interferon' or 'interferon-a' */ }
[:letter:]+/-[:letter:]+ { /* matches only 'interferon' from 'interferon-alpha' */ }
Of course, in this case you could easily avoid the collision between the second pattern and the first pattern by using {2,} instead of + for the second repetition, but it's perfectly OK to rely on pattern ordering since it's often inconvenient to guarantee that patterns don't overlap.

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

What does (^?)* mean in this regex?

I have this regex:
^(^?)*\?(.*)$
If I understand correctly, this is the breakdown of what it does:
^ - start matching from the beginning of the string
(^?)* - I don't know know, but it stores it in $1
\? - matches a question mark
(.*)$ - matches anything until the end of the string
So what does (^?)* mean?
The (^?) is simply looking for the literal character ^. The ^ character in a regex pattern only has special meaning when used as the first character of the pattern or the first character in a grouping match []. When used outside those 2 positions the ^ is interpreted literally meaning in looks for the ^ character in the input string
Note: Whether or not ^ outside of the first and grouping position is interpreted literally is regex engine specific. I'm not familiar enough with LUA to state which it does
Lua does not have a conventional regexp language, it has Lua patterns in its place. While they look a lot like regexp, Lua patterns are a distinct language of their own that has a simpler set of rules and most importantly lacks grouping and alternation features.
Interpreted as a Lua pattern, the example will surprising a longtime regexp user since so many details are different.
Lua patterns are described in PiL, and at a first glance are similar enough to a conventional regexp to cause confusion. The biggest differences are probably the lack of an alternation operator |, parenthesis are only used to mark captures, quantifiers (?, -, +, and *) only apply to a character or character class, and % is the escape character not \. A big clue that this example was probably not written with Lua in mind is the lack of the Lua pattern quoting character % applied to any (or ideally, all) of the non-alphanumeric characters in the pattern string, and the suspicious use of \? which smells like a conventional regexp to match a single literal ?.
The simple answer to the question asked is: (^?)* is not a recommended form, and would match ^* or *, capturing the presence or absence of the caret. If that were the intended effect, then I would write it as (%^?)%* to make that clearer.
To see why this is the case, let's take the pattern given and analyze it as a Lua pattern. The entire pattern is:
^(^?)*\?(.*)$
Handed to string.match(), it would be interpreted as follows:
^ anchors the match to the beginning of the string.
( marks the beginning of the first capture.
^ is not at the beginning of the pattern or a character class, so it matches a literal ^ character. For clarity that should likely have been written as %^.
? matches exactly zero or one of the previous character.
) marks the end of the first capture.
* is not after something that can be quantified so it matches a literal * character. For clarity that should likely have been written as %*.
\ in a pattern matches itself, it is not an escape character in the pattern language. However, it is an escape character in a Lua short string literal, making the following character not special to the string literal parser which in this case is moot because the ? that follows was not special to it in any case. So if the pattern were enclosed in double or single quotes, then the \ would be absorbed by string parsing. If written in a long string (as [[^(^?)*\?(.*)$]], the backslash would survive the string parser, to appear in the pattern.
? matches exactly zero or one of the previous character.
( marks the beginning the second capture.
. matches any character at all, effectively a synonym for the class [\000-\255] (remember, in Lua numeric escapes are in decimal not octal as in C).
* matches zero or more of the previous character, greedily.
) marks the end of the second capture.
$ anchors the pattern to the end of the string.
So it matches and captures an optional ^ at the beginning of the string, followed by *, then an optional \ which is not captured, and captures the entire rest of the string. string.match would return two strings on success (either or both of which might be zero length), or nil on failure.
Edit: I've fixed some typos, and corrected an error in my answer, noticed by Egor in a comment. I forgot that in patterns, special symbols loose their specialness when in a spot where it can't apply. That makes the first asterisk match a literal asterisk rather than be an error. The cascade of that falls through most of the answer.
Note that if you really want a true regexp in Lua, there are libraries available that will provide it. That said, the built-in pattern language is quite powerful. If it is not sufficient, then you might be best off adopting a full parser, and use LPeg which can do everything a regexp can and more. It even comes with a module that provides a complete regexp syntax that is translated into an LPeg grammar for execution.
In this case, the (^?) refers to the previous string "^" meaning the literal character ^ as Jared has said. Check out regexlib for any further deciphering.
For all your Regex needs: http://regexlib.com/CheatSheet.aspx
It looks to me like the intent of the creator of the expression was to match any number of ^ before the question mark, but only wanted to capture the first instance of ^. However, it may not be a valid expression depending on the engine, as others have stated.

A pattern matching an expression that doesn't end with specific sequence

I need a regex pattern which matches such strings that DO NOT end with such a sequence:
\.[A-z0-9]{2,}
by which I mean the examined string must not have at its end a sequence of a dot and then two or more alphanumeric characters.
For example, a string
/home/patryk/www
and also
/home/patryk/www/
should match desired pattern and
/home/patryk/images/DSC002.jpg should not.
I suppose this has something to do with lookarounds (look aheads) but still I have no idea how to make it.
Any help appreciated.
Old Answer
You can use a negative lookbehind at the end if your regex flavor supports it:
^.*+(?<!\.\w{2,})$
This will match a string that has an end anchor not preceded by the icky sequence you don't want.
Note that as m.buettner has pointed out, this uses an indefinite length lookbehind, which is a feature unique to .NET
New Answer
After a bit of digging around, however, I've found that variable length look-aheads are pretty widely supported, so here is a version that uses those:
^(?:(?!\.\w{2,}$).)++$
In a comment on an answer, you have stated you wanted to not match strings with forward slashes at the end, which is accomplished by simply adding a forward slash to the lookahead.
^(?:(?!(\.\w{2,}|/)$).)++$
Note that I am using \w for succinctness, but it lets underscores through. If this is important, you could replace it with [^\W_].
Asad's version is very convenient, but only .NET's regex engine supports variable-length lookbehinds (which is one of the many reasons why every regex question should include the language or tool used).
We can reduce this to a fixed-length lookbehind (which is supported in most engines except for JavaScrpit) if we think about the possible cases which should match. That would be either one or zero letters/digits at the end (whether preceded by . or not) or two or more letters/digits that are not preceded by a dot.
^.*(?:(?<![a-zA-Z0-9])[a-zA-Z0-9]?|(?<![a-zA-Z0-9.])[a-zA-Z0-9]{2,})$
This should do it:
^(?:[^.]+|\.(?![A-Za-z0-9]{2,}$))+$
It alternates between matching one or more of anything except a dot, or a dot if it's not followed by two or more alphanumeric characters and the end of the string.
EDIT: Upgrading it to meet the new requirement is just more of the same:
^(?:[^./]+|/(?=.)|\.(?![A-Za-z0-9]{2,}$))+$
Breaking that down, we have:
[^./]+ # one or more of any characters except . or /
/(?=.) # a slash, as long as there's at least one character following it
\.(?![A-Za-z0-9]{2,}$) # a dot, unless it's followed by two or more alphanumeric characters followed by the end of the string
On another note: [A-z] is an error. It matches all the uppercase and lowercase ASCII letters, but it also matches the characters [, ], ^, _, backslash and backtick, whose code points happen to lie between Z and a.
Variable length look behinds are rarely supported, but you don't need one:
^.*(?<!\.[A-z0-9][A-z0-9]?)$

Regexp Question - Negating a captured character

I'm looking for a regular expression that allows for either single-quoted or double-quoted strings, and allows the opposite quote character within the string. For example, the following would both be legal strings:
"hello 'there' world"
'hello "there" world'
The regexp I'm using uses negative lookahead and is as follows:
(['"])(?:(?!\1).)*\1
This would work I think, but what about if the language didn't support negative lookahead. Is there any other way to do this? Without alternation?
EDIT:
I know I can use alternation. This was more of just a hypothetical question. Say I had 20 different characters in the initial character class. I wouldn't want to write out 20 different alternations. I'm trying to actually negate the captured character, without using lookahead, lookbehind, or alternation.
This is actually much simpler than you may have realized. You don't really need the negative look-ahead. What you want to do is a non-greedy (or lazy) match like this:
(['"]).*?\1
The ? character after the .* is the important part. It says, consume the minimum possible characters before hitting the next part of the regex. So, you get either kind of quote, and then you go after 0-M characters until you encounter a character matching whichever quote you first ran into. You can learn more about greedy matching vs. non-greedy here and here.
Sure:
'([^']*)'|"([^"]*)"
On a successful match, the $+ variable will hold the contents of whichever alternate matched.
In the general case, regexps are not really the answer. You might be interested in something like Text::ParseWords, which tokenizes text, accounting for nested quotes, backslashed quotes, backslashed spaces, and other oddities.