How to match a string with an optional part? - regex

We have a string that we need to parse using regex, the string could be either:
There was a problem at XXXX
There was a problem at XXXX, previous failures were YYY
The XXX could be any character (e.g. ".")
How can we make regex that will match:
XXXX
", previous failures were YYY" (remember could be optional)
Every regex that I tried captures on the first match everything (because greedy or too little because not greedy)
I know this is advance but maybe someone already did it.

^There was a problem at (.*?)(?:, previous failures were (.*))?$
(.*?) means match everything, but match as little as possible to make this match match. The ^ and $ anchors force the regex to span the entire line so that it will always match something.
EDIT: If you really want the surrounding error text, and not just "XXX" and "YYY", then use the following regex instead:
^There was a problem at (.*?)(, previous failures were .*)?$
EDIT 2: Depending on the format of XXX, you may be able to get away with the following, but only if there are no comma's in "XXX". Unfortuanately, aside from this though, you need atleast the $ anchor to make sure the non-greedy match will match something. As you noted in your question, using a greedy match isn't an option at all (while using . atleast).
There was a problem at ([^,]*)(, previous failures were .*)?

A Perl, Java, Python, .NET, JavaScript etc. compatible regex could be
^There was a problem at (.*?)(, previous failures were .*)?$
if I understand your question correctly. If you need a code sample, please provide more details.

Related

Can I exclude Positive Lookaheads and Lookbehinds within a snippet in vscode?

I am having issues excluding parts of a string in a VSCode Snippet. Essentially, what I want is a specific piece of a path but I am unable to get the regex to exclude what I need excluded.
I have recently asked a question about something similar which you can find here: Is there a way to trim a TM_FILENAME beyond using TM_FILENAME_BASE?
As you can see, I am getting mainly tripped up by how the snippets work within vscode and not so much the regular expressions themselves
${TM_FILEPATH/(?<=area)(.+)(?=state)/${1:/pascalcase}/}
Given a file path that looks like abc/123/area/my-folder/state/...
Expected:
/MyFolder/
Actual:
abc/123/areaMyFolderstate/...
You need to match the whole string to achieve that:
"${TM_FILEPATH/.*area(\\/.*?\\/)state.*/${1:/pascalcase}/}"
See the regex demo
Details
.* - any 0+ chars other than line break chars, as many as possible
area - a word
-(\\/.*?\\/) - Group 1: /, any 0+ chars other than line break chars, as few as possible, and a /
-state.* - state substring and the rest of the line.
NOTE: If there must be no other subparts between area and state, replace .*? with [^\\/]* or even [^\\/]+.
The expected output seems to be different with part of a string in the input. If that'd be desired the expression might be pretty complicated, such as:
(?:[\s\S].*?)(?<=area\/)([^-])([^-]*)(-)([^\/])([^\/]*).*
and a replacement of something similar to /\U$1\E$2$3\U$4\E$5/, if available.
Demo 1
If there would be other operations, now I'm guessing maybe the pascalcase would do something, this simple expression might simply work here:
.*area(\\/.*?\\/).*
and the desired data is in this capturing group $1:
(\\/.*?\\/)
Demo 2
Building on my answer you linked to in your question, remember that lookarounds are "zero-length assertions" and "do not consume characters in the string". See lookarounds are zero-length assertions:
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
So in your snippet transform: /(?<=area)(.+)(?=state)/ the lookaround portions are not actually consumed and so are simply passed through. Vscode treats them, as it should, as not actually being within the "part to be transformed" segment at all.
That is why lookarounds are not excluded from your transform.

Having difficulty in a understanding regex backtracking

I was browsing through the regex tagged questions on SO when i came accross this problem,
A regex for a url was needed, the url begins with domain.com/advertorials/
The regex should match the following scenarios,
domain.com/advertorials
domain.com/advertorials?test=true
domain.com/advertorials/
domain.com/advertorials/?test=true
but not this,
domain.com/advertorials/version1?test=true
I came up with this regex advertorials\/?(?:(?!version)(.*))
This should work, but it doesnt for the last case. Looking at the debugger in regex101.com,
i see that after matching 's/' it matches 'version' word character by character and ultimately matches but since this is negative lookahead the condition fails. And this is the part i dont understand after failing it backtracks to before the '/' in 's/' and not after 's/'.
Is this how its supposed to work?? Can anyone help me understand?
(here's the demo link: https://regex101.com/r/ww3HR8/1).
Thanks,
Note: People already gave their solutions on that problem i just want to know why my regex fails.
The backtracking mechanism is in charge of this phenomenon, as you have already pointed out.
The ? quantifier, matching 1 or 0 repetitions of the quantified subpattern lets the regex engine match the string in two ways: either matching the quantified subpattern, or go on matching the string with subsequent subpattern.
So, advertorials/?(?!version)(.*) (I removed the redundant (?:...) non-capturing group), when applied to domain.com/advertorials/version1?test=true, matches advertorials, then matches /, and then the negative lookahead checks if, immediately to the right of the current position, there is version substring. Since there is version after /, the regex engine goes back and sees that /? pattern can match an empty string. So, the lookahead check is re-applied striaght after advertorials. There is no version after advertorials, and the match is returned.
The usual solution is using possessive quantifiers or atomic groups, but there are other approaches, too.
E.g.
advertorials\/?+(?!version)(.*)
^^
See the regex demo. Here, \/?+ matches 1 or 0 / chars, but once it matches, the egine cannot go back and re-match a part of a string with this pattern.
Or, you may include the /? in the lookahead and place it before /? pattern:
advertorials(?!\/?version)\/?(.*)
See another regex demo.
If you plan to disallow version anywhere after advertorials use
advertorials(?!.*version)\/?(.*)
See yet another demo.
Making the slash optional means there is a way to match without violating the constraint. If there is a way to match, the regex engine will find it, always.
Make the slash non-optional when it's followed by anything at all.
advertorials(?:/(?!version).*)?$
Incidentally, regex itself doesn't require the slash to be backslash-escaped (though some host languages use slashes as regex delimiters, so maybe you need to put it back). I also removed some redundant parentheses.
The reason:
This highlighted part is optional
advertorials\/?(?:(?!version)(.*))
Therefore it can also be advertorials(?:(?!version)(.*))
which matches advertorials/version
Essentially, (?!version)(.*) matches /version
Btw, this is normal backtracking by 1 character.
If you have already fixed it, then we're done !

Matching Everything That Does Not Contain Anything in an Array with Regex

I have the following regex query I'm trying to use to exclude assets from being cached:
^((?!(\.css|\.js|\.|\.json|\.xml|\.svg|\.ico|\.png|\.mp3|\.jpg|\.svg|\.woff|\.woff2|\.eot|\.ttf|\/api\/play\/add|\/api\/favorite|\/Listen\/channel|getAccountInfo)).)*$
Except it doesn't match https://exampl.com/home for some reason. Does anyone know how I can fix this? Also, is there anyway I can make the Regex expression better?
Your regex contains a |\.| part (after |\.js). That alternative makes your regex fail the match with any string containing a dot. You need to remove that alternative:
^((?!(\.css|\.js|\.json|\.xml|\.svg|\.ico|\.png|\.mp3|\.jpg|\.svg|\.woff|\.woff2|\.eot|\.ttf|\/api\/play\/add|\/api\/favorite|\/Listen\/channel|getAccountInfo)).)*$
See the regex demo

Parse with Regex without trailing characters

How can I successfully parse the text below in that format to parse just
To: User <test#test.com>
and
To: <test#test.com>
When I try to parse the text below with
/To:.*<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>/mi
It grabs
Message-ID <CC2E81A5.6B9%test#test.com>,
which I dont want in my answer.
I have tried using $ and \z and neither work. What am I doing wrong?
Information to parse
To: User <test#test.com> Message-ID <CC2E81A5.6B9%test#test.com>
To:
<test#test.com>
This is my parsing information in Rubular http://rubular.com/r/DQMQC4TQLV
Since you haven't specified exactly what your tool/language is, assumptions must be made.
In general regex pattern matching tends to be aggressive, matching the longest possible pattern. Your pattern starts off with .*, which means that you're going to match the longest possible string that ENDS WITH the remainder of your pattern <[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>, which was matched with <CC2E81A5.6B9%test#test.com> from the Message-ID.
Both Apalala's and nhahtdh's comments give you something to try. Avoid the all-inclusive .* at the start and use something that's a bit more specific: match leading spaces, or match anything EXCEPT the first part of what you're really interested in.
You need to make the wildcard match non greedy by adding a question mark after it:
To:.*?<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>

Regex lookahead

I am using a regex to find:
test:?
Followed by any character until it hits the next:
test:?
Now when I run this regex I made:
((?:test:\?)(.*)(?!test:\?))
On this text:
test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2
I expected to get:
test:?foo2=bar2&baz2=foo2
test:?foo=bar&baz=foo
test:?foo2=bar2&baz2=foo2
But instead it matches everything. Does anyone with more regex experience know where I have gone wrong? I've used regexes for pattern matching before but this is my first experience of lookarounds/aheads.
Thanks in advance for any help/tips/pointers :-)
I guess you could explore a greedy version.
(expanded)
(test:\? (?: (?!test:\?)[\s\S])* )
The Perl program below
#! /usr/bin/env perl
use strict;
use warnings;
$_ = "test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2";
while (/(test:\? .*?) (?= test:\? | $)/gx) {
print "[$1]\n";
}
produces the desired output from your question, plus brackets for emphasis.
[test:?foo2=bar2&baz2=foo2]
[test:?foo=bar&baz=foo]
[test:?foo2=bar2&baz2=foo2]
Remember that regex quantifiers are greedy and want to gobble up as much as they can without breaking the match. Each subsegment to terminate as soon as possible, which means .*? semantics.
Each subsegment terminates with either another test:? or end-of-string, which we look for with (?=...) zero-width lookahead wrapped around | for alternatives.
The pattern in the code above uses Perl’s /x regex switch for readability. Depending on the language and libraries you’re using, you may need to remove the extra whitespace.
Three issues:
(?!) is a negative lookahead assertion. You want (?=) instead, requiring that what comes next is test:?.
The .* is greedy; you want it non-greedy so that you grab just the first chunk.
You're wanting the last chunk also, so you want to match $ as well at the end.
End result:
(?:test:\?)(.*?)(?=test:\?|$)
I've also removed the outer group, seeing no point in it. All RE engines that I know of let you access group 0 as the full match, or some other such way (though perhaps not when finding all matches). You can put it back if you need to.
(This works in PCRE; not sure if it would work with POSIX regular expressions, as I'm not in the habit of working with them.)
If you're just wanting to split on test:?, though, regular expressions are the wrong tool. Split the strings using your language's inbuilt support for such things.
Python:
>>> re.findall('(?:test:\?)(.*?)(?=test:\?|$)',
... 'test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2')
['foo2=bar2&baz2=foo2', 'foo=bar&baz=foo', 'foo2=bar2&baz2=foo2']
You probably want ((?:test:\?)(.*?)(?=test:\?)), although you haven't told us what language you're using to drive the regexes.
The .*? matches as few characters as possible without preventing the whole string from matching, where .* matches as many as possible (is greedy).
Depending, again, on what language you're using to do this, you'll probably need to match, then chop the string, then match again, or call some language-specific match_all type function.
By the way, you don't need to anchor a regex using a lookahead (you can just match the pattern to search for, instead), so this will (most likely) do in your case:
test:[?](.*?)test:[?]