Negative lookahead preceded by .* - regex

I want to select all text within {}, but only if there is no \status…{} in there.
Examples that should match:
\subsection{Hello} -> "\subsection”, "Hello"
\section{Foobar} -> "\section", "Foobar"
\subsubsection{This is a Triumph} -> "\subsubsection", "This is a Triumph"
Examples that should not match:
\subsection{Hello\statusdone{}}
\section{Hello World\statuswip{}}
\section{Everything\statusproofreading{}}
I thought negative lookaheads would be perfect for this:
(\\.*section)\{(.*)(?!\\status.*)\}
but they match:
\subsection{Hello\statusdone{}} -> "\subsection", "Hello\statusdone{}"
\section{Hello World\statuswip{}} -> "\section", "Hello World\statuswip{}"
\section{Everything\statusproofreading{}} -> "\section", "Everything\statusproofreading{}"
I suspect it is because of the .* preceding the negative lookahead. If I replace it with, e.g.g, Hello in the following regex:
(\\.*section)\{(Hello)(?!\\status.*)\}
It correctly does not match the first negative example \subsection{Hello\statusdone{}}.
How do I work around that?

You should move the negative lookahead earlier in the pattern, so that it checks for the presence of that substring before the entire string (.*) is consumed.
You can use:
\\.*section\{((?!.*\\status.*\{\})[^}]+)}
Live demo here.

Regex doesn't have a needle not-inside haystack tester. (Or at least not common implementation of it.)
You're confusing the way zero-width assertions work. It's an ANY match, not an ALL match. The instant that the first position matches, that fits and it returns it.
You have a two-pass job ahead of you. First problem is that you don't have a Regular language here in LaTeX or whatever, and that means Regular Expressions aren't going to work well for arbitrary text.
\section{\math{\ref{\status{asfd}}}} and the final "}" you match, etc.
You need a parser to do that right, not regex. Sorry.

Related

How do you match a pattern skipping exceptions?

In vim, I'd like to match a regular expression in a search and replace operation, but with exceptions — a list of matches that I want to skip.
For example, suppose I have the text:
-one- lorem ipsum -two- blah blah -three- now is the time -four- the quick brown -five- etc. etc.
(but with lots of other possibilities) and I want to match -\(\w\+\)- and replace it with *\1* but skipping over (not matching) -two- and -four-, so the result would be:
*one* lorem ipsum -two- blah blah *three* now is the time -four- the quick brown *five* etc. etc.
It seems like I should be able to use some kind of assertion (lookbehind, lookahead, something) for this, but I'm coming up blank.
You're looking for a negative lookahead assertion. In Vim, that's done via :help /\#!, like (?!pattern) in Perl.
Basically, you say don't match FOO here, and in general match word characters:
/-\(\%(FOO\)\#!\w\+\)-/
Note how I'm using non-capturing groups (:help /\%(). What's still missing is an assertion on the end, so the above would also exclude -FOOBAR-. As we have a unique end delimiter here, it's easiest to append that:
/-\(\%(FOO-\)\#!\w\+\)-/
Applied to your example, you just need to introduce two branches (for the two exclusions) in place of FOO, and you're done:
/-\(\%(two-\|four-\)\#!\w\+\)-/
Or, by factoring out the duplicated end delimiter:
/-\(\%(\%(two\|four\)-\)\#!\w\+\)-/
This matches any word characters in between dashes, except if those words form either two or four.
The negative lookahead in my other answer is the direct solution, but its syntax is a bit complex (and there can be patterns where the rules for the delimiter are not so simple, and the result then is much less readable).
As you're using substitution, an alternative is to put the selection logic into the replacement part of :substitute. Vim allows a Vimscript expression in there via :help sub-replace-expression.
We have the captured word in submatch(1) (equivalent to \1 in a normal replacement), and now just need to check for the two excluded words; if it's one of those, do a no-op substitution by returning the original full match (submatch(0)), else just return the captured group.
:substitute/-\(\w\+\)-/\=submatch(index(['two', 'four'], submatch(1)) == -1 ? 1 : 0)/g
It's not shorter than the lookahead pattern (well, we could golf the pattern and drop the ternary operator, as a boolean is represented by 0/1, anyway), so here I would still use the pattern. But in general, it's good to know that there's more than one way to do it :-)

Regex: Detect the presence of overlapping or non-overlapping repeating patterns of a given length

Is this possible to do with regex?
For example, in: "tagaga", I'd like to match "aga" because it occurs more than once.
'(.{3})(.*)\1'
finds non-overlapping matches (matches "agacaga" in "tagacaga") but not overlapping matches.
However, using look-ahead in this way does not work for me:
'(.{3})(.*)(?=\1)'
Alternatively, if the regex solution doesn't exist, is there a dynamic programming solution for this?
Ultimately, I only care about presence and do not need the matched string. I am working in MATLAB, if it makes any difference.
How about this:
Test string:
tagaga
Regex:
(?=(aga)).{2}(?<=.)(\1)
Matches:
"aga", "aga"
Working regex example:
http://regex101.com/r/uT5fS1
However depending on the length if the match, ie. in your example aga length is 3, so you would have to modify the quantifier to the length -1. (in this case {2}). So.. If your match was abca you would have to change the quantifier to {3}.
So with test example:
abcabca
Regex:
(?=(abca)).{3}(?<=.)(\1)
Matches:
"abca", "abca"
Move lookahead part to the middle:
(.+?)(?=(.+?))\1\2
Example in Javascript:
/(.+?)(?=(.+?))\1\2/.test('asdf') // false
/(.+?)(?=(.+?))\1\2/.test('tagaga') // true
/(.+?)(?=(.+?))\1\2/.test('tagacaga') // false
/(.+?)(?=(.+?))\1\2/.test('agag') // false

Using regex to find a pattern which does not start with a certain String

I need to regex-match numbers in a certain pattern which works already, but only if there is not (+ right in front of it.
Example Strings I want to have a valid match within: 12, 12.5, 200/300, 200/300/400%, 1/2/3/4/5/6/7
Example Strings I want to have no valid match within: (+10% juice), (+4)
I can already get all the valid matches with (\d+[/%.]?)+, but I need help to exclude the example Strings I want to have no valid match in (which means, only match if there is NOT the String (+ right in front of the mentioned pattern).
Can someone help me out? I have already experimented with the ! (like ?!(\(\+)(\d+[/%.]?)+) but for some reason I can't it to work the way I need it.
(You can use http://gskinner.com/RegExr/ to test regex live)
EDIT: I did maybe use wrong words. I don't want to check if the searchstring does start with (+ but I want to make sure that there is no (+ right in front of my String.
Try regexr with the following inputs:
Match: (\d+[/%.]?)+
Check the checkbox for global (to search for more than one match within the text)
Text:
this should find a match: 200/300/400
this shouldnt find any match at all: (+100%)
this should find a match: 40/50/60%
this should find a match: 175
Currently it will find a match in all 4 lines. I want a regex that does no longer find a match in line 2.
The regex construct you are wanting is "Negative Lookbehind" - See http://www.regular-expressions.info/lookaround.html. A negative lookbehind is defined like (?<!DONTMATCHME) where DONTMATCHME is the expression you don't want to find just before the next bit of the expression. Crutially, the lookbehind bit is not considered part of the match itself
Your expression should be:
(?<![+\d/\.])(\d+[/%.]?)+
Edit - changed negative lookbehind to any character that is not a + or another digit!
Edit #2 - moved the lookbehind outside the main capture brackets. Expanded the range of not acceptable characters before the match to include / & .

Regular Expression to match fractions and not dates

I'm trying to come up with a regular expression that will match a fraction (1/2) but not a date (5/5/2005) within a string. Any help at all would be great, all I've been able to come up with is (\d+)/(\d+) which finds matches in both strings. Thanks in advance for the help.
Assuming PCRE, use negative lookahead and lookbehind:
(?<![\/\d])(\d+)\/(\d+)(?![\/\d])
A lookahead (a (?=) group) says "match this stuff if it's followed by this other stuff." The contents of the lookahead aren't matched. We negate it (the (?!) group) so that it
doesn't match stuff after our fraction - that way, we don't match the group in what follows.
The complement to a lookahead is a lookbehind (a (?<=) group) does the opposite - it matches stuff if it's preceeded by this other stuff, and just like the lookahead, we can negate it (the (?<!) group) so that we can match things that don't follow something.
Together, they ensure that our fraction doesn't have other parts of fractions before or after it. It places no other arbitrary requirements on the input data. It will match the fraction 2/3 in the string "te2/3xt", unlike most of the other examples provided.
If your regex flavor uses //s to delimit regular expressions, you'll have to escape the slashes in that, or use a different delimiter (Perl's m{} would be a good choice here).
Edit: Apparently, none of these regexes work because the regex engine is backtracking and matching fewer numbers in order to satisfy the requirements of the regex. When I've been working on one regex for this long, I sit back and decide that maybe one giant regex is not the answer, and I write a function that uses a regex and a few other tools to do it for me. You've said you're using Ruby. This works for me:
>> def get_fraction(s)
>> if s =~ /(\d+)\/(\d+)(\/\d+)?/
>> if $3 == nil
>> return $1, $2
>> end
>> end
>> return nil
>> end
=> nil
>> get_fraction("1/2")
=> ["1", "2"]
>> get_fraction("1/2/3")
=> nil
This function returns the two parts of the fraction, but returns nil if it's a date (or if there's no fraction). It fails for "1/2/3 and 4/5" but I don't know if you want (or need) that to pass. In any case, I recommend that, in the future, when you ask on Stack Overflow, "How do I make a regex to match this?" you should step back first and see if you can do it using a regex and a little extra. Regular expressions are a great tool and can do a lot, but they don't always need to be used alone.
EDIT 2:
I figured out how to solve the problem without resorting to non-regex code, and updated the regex. It should work as expected now, though I haven't tested it. I also went ahead and escaped the /s since you're going to have to do it anyway.
EDIT 3:
I just fixed the bug j_random_hacker pointed out in my lookahead and lookbehind. I continue to see the amount of effort being put into this regex as proof that a pure regex solution was not necessarily the optimal solution to this problem.
Use negative lookahead and lookbehind.
/(?<![\/\d])(?:\d+)\/(?:\d+)(?![\/\d])/
EDIT: I've fixed my answer to trap for the backtracking bug identified by #j_random_hacker. As proof, I offer the following quick and dirty php script:
<?php
$subject = "The match should include 1/2 but not 12/34/56 but 11/23, now that's ok.";
$matches = array();
preg_match_all('/(?<![\/\d])(?:\d+)\/(?:\d+)(?![\/\d])/', $subject, $matches);
var_dump($matches);
?>
which outputs:
array(1) {
[0]=>
array(2) {
[0]=>
string(3) "1/2"
[1]=>
string(5) "11/23"
}
}
Lookahead is great if you're using Perl or PCRE, but if they are unavailable in the regex engine you're using, you can use:
(^|[^/\d])(\d+)/(\d+)($|[^/\d])
The 2nd and 3rd captured segments will be the numerator and denominator.
If you do use the above in a Perl regex, remember to escape the /s -- or use a different delimiter, e.g.:
m!(?:^|[^/])(\d+)/(\d+)(?:$|[^/])!
In this case, you can use (?:...) to avoid saving the uninteresting parenthesised parts.
EDIT 18/12/2009: Chris Lutz noticed a tricky bug caused by backtracking that plagues most of these answers -- I believe this is now fixed in mine.
if its line input you can try
^(\d+)\/(\d+)$
otherwise use this perhaps
^(\d+)\/(\d+)[^\\]*.
this will work: (?<![/]{1})\d+/\d+(?![/]{1})
Depending on the language you're working with you might try negative-look-ahead or look-behind assertions: in perl (?!pattern) asserts that /pattern/ can't follow the matched string.
Or, again, depending on the language, and anything you know about the context, a word-boundary match (\b in perl) might be appropriate.

Regex with exception of particular words

I have problem with regex.
I need to make regex with an exception of a set of specified words, for example: apple, orange, juice.
and given these words, it will match everything except those words above.
applejuice (match)
yummyjuice (match)
yummy-apple-juice (match)
orangeapplejuice (match)
orange-apple-juice (match)
apple-orange-aple (match)
juice-juice-juice (match)
orange-juice (match)
apple (should not match)
orange (should not match)
juice (should not match)
If you really want to do this with a single regular expression, you might find lookaround helpful (especially negative lookahead in this example). Regex written for Ruby (some implementations have different syntax for lookarounds):
rx = /^(?!apple$|orange$|juice$)/
I noticed that apple-juice should match according to your parameters, but what about apple juice? I'm assuming that if you are validating apple juice you still want it to fail.
So - lets build a set of characters that count as a "boundary":
/[^-a-z0-9A-Z_]/ // Will match any character that is <NOT> - _ or
// between a-z 0-9 A-Z
/(?:^|[^-a-z0-9A-Z_])/ // Matches the beginning of the string, or one of those
// non-word characters.
/(?:[^-a-z0-9A-Z_]|$)/ // Matches a non-word or the end of string
/(?:^|[^-a-z0-9A-Z_])(apple|orange|juice)(?:[^-a-z0-9A-Z_]|$)/
// This should >match< apple/orange/juice ONLY when not preceded/followed by another
// 'non-word' character just negate the result of the test to obtain your desired
// result.
In most regexp flavors \b counts as a "word boundary" but the standard list of "word characters" doesn't include - so you need to create a custom one. It could match with /\b(apple|orange|juice)\b/ if you weren't trying to catch - as well...
If you are only testing 'single word' tests you can go with a much simpler:
/^(apple|orange|juice)$/ // and take the negation of this...
This gets some of the way there:
((?:apple|orange|juice)\S)|(\S(?:apple|orange|juice))|(\S(?:apple|orange|juice)\S)
\A(?!apple\Z|juice\Z|orange\Z).*\Z
will match an entire string unless it only consists of one of the forbidden words.
Alternatively, if you're not using Ruby or you're sure that your strings contain no line breaks or you have set the option that ^ and $ do not match on beginnings/ends of lines
^(?!apple$|juice$|orange$).*$
will also work.
Here's some easy copy-paste code that works for more than just exact-words exceptions.
Copy/Paste Code:
In the following regex, ONLY replace the all-caps sections with your regex.
Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript
Impossible as of 6/17/2020, and probably won't be possible in the near future.
Full Examples
REGEX_BEFORE = \b
YOUR_NORMAL_PATTERN = \w+
REGEX_AFTER =
EXCEPTION_PATTERN = (apple|orange|juice)
Python regex
pattern = r"\b(?>(?P<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)"
Ruby regex
pattern = /\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
PCRE regex
\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)
How does it work?
This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.
The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).
The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.
Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.
This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.
Exact answer to the original question in Ruby
/\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Unlike other methods, this one can be modified to reject any pattern such as any word not containing the sub-string "apple","orange", or "juice".
/\b(?>(?<exceptions_group_1>\w*(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Something like (PHP)
$input = "The orange apple gave juice";
if(preg_match("your regex for validating") && !preg_match("/apple|orange|juice/", $input))
{
// it's ok;
}
else
{
//throw validation error
}