Regex ignore part of the string in matches - regex

Suppose I have a tags object as such:
["warn-error-fatal-failure-exception-ok","parsefailure","anothertag","syslog-warn-error-fatal-failure-exception-ok"]
I would like to be able to use regex to match on "failure" but exclude "warn-error-fatal-failure-exception-ok".
So in the above case if I used my regex to search for failure it should only match failure on parsefailure and ignore the rest.
How can this be accomplished using regex?
NOTE: The regex has to exclude the whole string "warn-error-fatal-failure-exception-ok"

EDIT
After documenting the answer below, I realized that maybe what you are looking for is:
(?<!warn-error-fatal-)failure(?!-exception-ok)
So I'm adding it here in case that it fits what you are looking for better. This regex is just looking for "failure" but using a Negative Lookbehind and a Negative Lookahead to specify that "failure" may not be preceded by "warn-error-fatal-" or followed by "-exception-ok".
ANSWER DEVELOPED FROM COMMENTS:
The following regex captures the "failure" substring in the "parsefailure" tag, and it puts it in Group 1:
^.*"(?![^"]*warn-error-fatal-failure-exception-ok[^"]*)[^"]*(failure)[^"]*".*$
DETAIL
I will break the regex in parts, and I'll explain each. First, let's forget about everything in between the first set of parentheses, and let's just look at the rest.
^.*"[^"]*(failure)[^"]*".*$
The important part of the regex is what we are trying to capture in the group, which is the word "failure" which itself is a part of a tag surrounded by double-quotes. The regular expression above matches the whole test string, but it focuses on a tag surrounded by double-quotes and containing the substring "failure".
^.*" matches any character from the beginning of the string to a quote
"[^"]*(failure)[^"]*" matches a tag surrounded by double-quotes and containing the substring "failure". Literally: a double-quote, followed by zero or more characters that are not double-quotes, followed by "failure", followed by zero or more characters that are not double-quotes, followed by a double-quote. The parentheses capture the word "failure" in group 1.
".*$ matches any character from the double-quote to the end of the test string
Because [^"]*(failure)[^"]* matches all tags containing the substring "failure", ^.*"[^"]*(failure)[^"]*".*$ will capture the substring "failure" from the first tag containing the string. In other words, it will capture "failure" from the warn-error-fatal-failure-exception-ok tag which is not what we want, so we most exclude the warn-error-fatal-failure-exception-ok tag from being a possible match to the tag portion of the regex: [^"]*(failure)[^"]*. This is achieved with a Negative Lookahead:
(?![^"]*warn-error-fatal-failure-exception-ok[^"]*)
This Negative Lookahead basically means: "The regular expression following the Negative Lookahead can't match [^"]*warn-error-fatal-failure-exception-ok[^"]*". The (?! and ) are just part of the syntax. You can read more about it here.
MORE BREAKDOWN
^ matches the beginning of the test string
.* matches any character zero or more times
" matches a double-quote character
[^"]* matches any character other than the double-quote character zero or more times
(failure) matches the word "failure", and since it is in parentheses, it will capture it in a group; in this case, it will be captured in group 1 because there is only one set of capturing parentheses. The parentheses of the Negative Lookahead are non-capturing.
$ matches the end of the test string

RegularExpression : [A-Za-z-]*(?<!("warn-error-fatal-))failure
Recognizes parsefailure and "syslog-warn-error-fatal-failure-exception-ok" not the other failure.

Related

Regular Expression to prevent Email Name Spoofing

I want to match everything where .com or my\s?example appears in the display name of a From header and where the From email address is not .*#myexample.com.
It's easy when the display name is enclosed by quotation marks, but fails when the quotation marks are absent.
"(.*?(my\s?example|\.com).*?)"(?!\s?\<.*?\#myexample\.com\>)
Please see here:
https://regexr.com/5im6l
Everything works as desired except for the last line in the input field, where the double quotes are missing. I would like it to also match for this.
If an if clause is supported, and you want to capture what is between the double quotes if they are both there or capture the whole string if there are no double quotes at the start and end, you might use:
\bFrom:\s(")?(.*?\b(my\s?example|\.com)\b.*?)(?(1)")\s+<(?!\s?[^\r\n<>]*#myexample\.com>)
The pattern matches:
\bFrom:\s(")? A word boundary, match From: and optionally capture " in group 1
(.*?\b(my\s?example|\.com)\b.*?) Capture group 2, match a part that contains either myexample or .com where the alternatives are in group 3
(?(1)") If clause, if group 1 exists, match " so it is not part of the capture group
\s+< Match 1+ whitespace chars and <
(?! Negative lookahead, assert that what is at the right is not
\s?[^\r\n<>]*#myexample\.com> Match #myexample\.com between the brackets
) Close lookahead
Group 2 contains the whole match, and group 3 contains a part with either Myexample or .com using a case insensitive match.
Regex demo
If \K is supported to forget what is matched so far, and you want as another example a match only:
\bFrom:\s"?\K.*?\b(?:my\s?example|\.com)\b.*?(?="?\s<(?![^<>]*#myexample\.com>))
Regex demo
Note that you don't have to escape \< \> and \#

Regex: exclude string from matched pattern

Input string:
hrStorageDescr{hrStorageDescr="devfs: dev file system, mounted on: /.mount/dev"}
Regex to match value of hrStorageDescr only:
.*hrStorageDescr="(.*?)",.*
How to write this regex in order to preserve matching function, but exclude everything in the value, if devfs string is matched?
You could match bhrStorageDescr preceded by a word boundary \b
First match =" and assert what is directly to the right is not devfs followed by a word boundary using a negative lookahead (?!devfs\b)
If that assertion succeeds, capture in the group matching any char except a " using a negated character class and close the group before matching the closing double quote ([^"]+)
Using .* will match the last occurrence of the pattern, using .*? will match the first. If you want to match all occurrences you could omit that part, assuming you allowed to match all matches instead of a single match.
.*?\bhrStorageDescr="(?!devfs\b)([^"]+)"
Regex demo

REGEX: Select all text between last underscore and dot

I'm having trouble retrieving specific information of a string.
The string is as follows:
20190502_PO_TEST.pdf
This includes the .pdf part. I need to retrieve the part between the last underscore (_) and the dot (.) leaving me with TEST
I've tried this:
[^_]+$
This however, returns:
TEST.PDF
I've also tried this:
_(.+)\.
This returns:
PO_TEST
This pattern [^_]+$ will match not an underscore until the end of the string and will also match the .
In this pattern _(.+). you have to escape the dot to match it literally like _(.+)\. see demo and then your match will be in the first capturing group.
What you also might use:
^.*_\K[^.]+
^.*_ Match the last underscore
\K Forget what was matched
[^.]+ Match 0+ times not a dot
Regex demo

Why is this regex selecting this text

I am using the regex
(.*)\d.txt
on the expression
MyFile23.txt
Now the online tester says that using the above regex the mentioned string would be allowed (selected). My understanding is that it should not be allowed because there are two numeric digits 2 and 3 while the above regex expression has only one numeric digit in it i.e \d.It should have been \d+. My current expression reads. Zero of more of any character followed by one numeric digit followed by .txt. My question is why is the above string passing the regex expression ?
This regex (.*)\d.txt will still match MyFile23.txt because of .* which will match 0 or more of any character (including a digit).
So for the given input: MyFile23.txt here is the breakup:
.* # matches MyFile2
\d # matched 3
. # matches a dot (though it can match anything here due to unescaped dot)
txt # will match literal txt
To make sure it only matches MyFile2.txt you can use:
^\D*\d\.txt$
Where ^ and $ are anchors to match start and end. \D* will match 0 or more non-digit.
The pattern you have has one group (.*) which would match using your example:MyFile2
because the . allows any character.
Furthermore the . in the pattern after this group is not escaped which will result in allowing another character of any kind.
To avoid this use:
(\D*)\d+\.txt
the group (\D*) would now match all non digit characters.
Here is the explanation, your "MyFile23.txt" matches the regex pattern:
A literal period . should always be escaped as \. else it will match "any character".
And finally, (.*) matches all the string from the beginning to the last digit (MyFile2). Have a look at the "MATCH INFORMATION" area on the right at this page.
So, I'd suggest the following fix:
^\D*\d\.txt$ = beginning of a line/string, non-digit character, any number of repetitions, a digit, a literal period, a literal txt, and the end of the string/line (depending on the m switch, which depends on the input string, whether you have a list of words on separate lines, or just a separate file name).
Here is a working example.

Match everything to the first unescaped (with \) character

I have following input:
!foo\[bar[bB]uz\[xx/
I want to match everything from start to [, including escaped bracket \[ and ommiting first characters if in [!#\s] group
Expected output:
foo\[bar
I've tried with:
(?![!#\s])[^/\s]+\[
But it returns:
foo\[bar[bB]uz\[
Java: Use Lookbehind
(?<=!)(?:\\\[|[a-z])+
See the regex demo
Explanation
The lookbehind (?<=!) asserts that what precedes the current position is the character !
The non-capture group (?:\\\[|[a-z]) matches \[ OR | a letter between a and z
The + causes the group to be matched one or more times
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
You can use this regex:
!((?:[^[\\]*\\\[)*[^[]*)
Online Regex Demo
Add a ? after [^/\s]+ to catch the shortest group possible
Add \w+ to the end to catch the first group of alphanumeric characters after \[
Result :
(?![!#\s])[^\/\s]+?\[\w+
Try it
You can try this pattern:
(?<=^[!#\s]{0,1000})(?:[^!#\s\\\[]|\\.)(?>[^\[\\]+|\\.)*(?=\[)
pattern details:
The begining is a lookbehind and means preceded by zero or several forbidden characters at the start of the string
(?:[^!#\s\\\[]|\\.) ensures that the first character is an allowed character or an escaped character.
(?>[^\[\\]+|\\.)* describes the content: all that is not a [ or a \, or an escaped character. (note that this subpattern can be written like that too: (?:[^\[\\]|\\.)*)
(?=\[) checks that the next character is a literal opening square bracket. (since all escaped characters are matched by the precedent group, you can be sure that this one is not escaped)
link to fiddle (push the Java button)
Use a negated character class first the start (ie the match must not start with a special char), then a reluctant quantifier (which stops at the first hit), with a negative look behind to skip over escaped brackets:
[^!#\s].*?(?<!\\)\[
See live demo