I am trying to set up a regex that will match the following:
*test*
*t*
*te*
But, I do not want it to match:
*test**
The general rules are:
Must start with the beginning of the line (^) or a whitespace character (\s)
Must have one and only one *
Can match any character
Must match one more *
Must end with end of the line ($) or a whitespace character (\s)
I have generated the following regex:
(\s|^)\*([^\*].+?[^\*])\*(\s|$)
This nearly satisfies my requirements; however, because of the two [^\*] groups within the second capturing group, it seems to require that capturing group to be 3 characters or more. *tes* matches, but *t* and *te* do not.
I have three specific questions:
Why does the character negation lead to the 3 character limit?
Is there a better way to express "any character except" than I have done here?
Any thoughts on a better regex to satisfy my requirements?
The problem in the regex is an extra . in the capturing group
[^\*].+?[^\*]
^
This will match a character except * followed by one or more of any characters except newline.
As the character class is repeated twice, you can use + quantifier to match one or more characters.
(\s|^)\*([^\*]+?)\*(\s|$)
Demo
You can also use non-capturing groups to exclude the extra matches.
(?:\s|^)\*([^\*]+?)\*(?:\s|$)
Demo 2
Related
Using Java, i want to detect if a line starts with words and separator then "myword", but this regex takes too long. What is incorrect ?
^\s*(\w+(\s|/|&|-)*)*myword
The pattern ^\s*(\w+(\s|/|&|-)*)*myword is not efficient due to the nested quantifier. \w+ requires at least one word character and (\s|/|&|-)* can match zero or more of some characters. When the * is applied to the group and the input string has no separators in between word characters, the expression becomes similar to a (\w+)* pattern that is a classical catastrophical backtracking issue pattern.
Just a small illustration of \w+ and (\w+)* performance:
\w+: (\w+)*
You pattern is even more complicated and invloves more those backtracking steps. To avoid such issues, a pattern should not have optional subpatterns inside quantified groups. That is, create a group with obligatory subpatterns and apply the necessary quantifier to the group.
In this case, you can unroll the group you have as
String rx = "^\\s*(\\w+(?:[\\s/&-]+\\w+)*)[\\s/&-]+myword";
See IDEONE demo
Here, (\w+(\s|/|&|-)*)* is unrolled as (\w+(?:[\s/&-]+\w+)*) (I kept the outer parentheses to produce a capture group #1, you may remove these brackets if you are not interested in them). \w+ matches one or more word characters (so, it is an obligatory subpatter), and the (?:[\s/&-]+\w+)* subpattern matches zero or more (*, thus, this whole group is optional) sequences of one or more characters from the defined character class [\s/&-]+ (so, it is obligatory) followed with one or more word characters \w+.
I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.
I have the following strings:
'TwoOrMoreDimensions'
'LookLikeVectors'
'RecentVersions'
'= getColSums'
'=getColSums'
I would like to capture all occurrences of an uppercase letter that is preceded by a lowercase letter in all strings but the last two.
I can use ([a-z]+)([A-Z]) to capture all such occurrences but I don't know how to exclude matches from the last two strings.
The last two strings can be excluded using the negative lookahead ^(?!>\s|\=) - is it possible to combine this with the expression above?
I tried ^(?!>\s|\=)(([a-z]+)([A-Z])) but it doesn't yield any matches. I'm not sure why because ^(?!>\s|\=)(.+) captures all characters after the start of the matching string as a group. So why can't this capture group be further divided into group 2 ([a-z]+) and group 3 ([A-Z])?
Link to tester
The issue with your current regex is that the ^ anchors it to the start of string, so it can only match a sequence of lower case letters followed by an upper case letter at the start of the string, and none of your strings have that.
One way to do what you want is to use the \G anchor, which forces the current match to start where the previous one ended. That can be used in an alternation with ^(?!=) which will match any string which doesn't start with an = sign, and then a negated character class ([^a-z]) to skip any non-lower case characters:
(?:^(?!=)|\G)[^a-z]*(([a-z]+)([A-Z]))
This will give the same capture groups as your original regex.
Demo on regex101
Another solution (may not be the most efficient but meets the task) would be (?:^=\s*\w*)|([a-z]+)([A-Z])
This essentially forces the regex to greedily consume everything (in a non-capturing group, although is considered for full match) if it begins with =, leaving nothing for the next capture groups.
Regex101 Demo Link
I am looking for regex to match following set:
/VIDEO_PRE_MINE
/VIDEO_PRE
/VIDEO_PRE/
/VIDEO_PRE/SOMETHING
And I want exclude expresions like this:
/VIDEO_PRESOMETHING
/VIDEO_PREsomething/something
In other words after expression '_PRE' cannot be any literal character, but it can be end of the string.
Here are regexes that i tried:
1. ^\/[^\/]*_PRE[^a-z|A-Z]
2. ^\/[^\/]*_PRE[^a-z|A-Z]?$
However I didn't manage to cover all use cases from sets with those regex.
I would really appreciate any help with this.
Thanks
For your example data, you could add an optional group (?:[_/].*)? to match either a _ or / followed by matching any char except a newline 0+ times until the end of the string $
^/[^/]*_PRE(?:[_/].*)?$
^ Start of string
/[^/]* Match /, then 0+ times any char except /
_PRE Match literally
(?: Non capturing group
[_/].* Match either _ or / followed by 0+ times any char except a newline
)? Close non capturing group and make it optional
$ End of string
Regex demo
Note that the forward slashes are not escaped. Depending on the language or delimiters you might have to escape them.
My guess is that we might want to have some right boundaries, such as
^\/VIDEO_PRE(?:\b\/?|\/[^\/\s]+\/?|_[^\/\s]+\/?)$
in specified form, and in general form:
^\/[^_]+_PRE(?:\b\/?|\/[^\/\s]+\/?|_[^\/\s]+\/?)$
which might work. You would likely want to test and modify the expression, which is explained on the top right panel of regex101.com, if you wish to explore/simplify it, and in this link, you can watch how it would match against some sample inputs, if you like.
DEMO
I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!
In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++
You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.
You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.