Regex match non-greedy on one optional string and greedy on another - regex

I've researched around for a while and haven't found a clue for matching the following pattern (I am also very new to regex, though), it looks either like
/abc/foo/bar(/*)
or
/abc/foo/bar/stop
So I want to match and capture the above string as /abc/foo/bar. Now "/stop" is an optional string that could be appended at the end of the pattern. The goal is to get the desired capture while ignoring "stop" if they present (and if "stop" exists multiple times stop at the first "stop"), while allow as many slashes in the middle as possible except the slash at the end of line.
If I simply do:
^(/.*[^/])/*$
Which is greedy in including all slashes until I strip off the possible last occurrence; but in order to accept the second case where I have an optional "/stop", I need to match in a non-greedy way until I find the first possible "/stop" and stop there.
How can I craft a single regex that matches both cases?
EDIT: Not sure if my previous example wasn't clear enough. Try to give more, say I want to match and capture "/abc/foo/bar" in all of the following strings:
/abc/foo/bar
/abc/foo/bar/
/abc/foo/bar///
/abc/foo/bar/stop
/abc/foo/bar/stop/foo/bar/stop/stop
/abc/foo/bar//stop
While it won't match any of the followings:
/abc/foo/bar/sto (will match the whole "/abc/foo/bar/sto" instead)
/abc/foo/bar/abc/foo/bar (it will catch "/abc/foo/bar/abc/foo/bar" instead)
Let me know if this is clear enough. Thanks!

Try this:
/^(?:\/+(?!$|(?:stop\/?))[^\/]+)*/
Regex101 Demo
Explanation:
This matches the start of the string (^), followed by zero or more instances of the following pattern:
one or more slashes (\/+) that are not followed by the end of the string ($) or by stop, followed by
one or more non-slash characters ([^\/]+)
Here's a Debuggex Demo with working unit tests.
EDIT: Here is an alternative, arguably simpler, regex:
/^.+?(?=\/*$|\/+stop\b)/
This matches one or more characters in a non-greedy manner, then stops when whatever is after the match is one of the following:
the end of the string ($), possibly preceded by one or more slashes (\/*)
one or more slashes, the word stop, and a word break.
Here's a Regex101 demo of this option.
EDIT 2: If you'd like to test this, here's a simple JavaScript test that tests the second regex above against various test strings and logs the results to the console:
var re = /^.+?(?=\/*$|\/+stop\b)/,
test_strings = ["/abc/foo/bar",
"/abc/foo/bar/",
"/abc/foo/bar///",
"/abc/foo/bar/stop",
"/abc/foo/bar/stop/foo/bar/stop/stop",
"/abc/foo/bar//stop",
"/abc/foo/bar/sto",
"/abc/foo/bar/abc/foo/bar"];
for(var s = 0; s < test_strings.length; s++) {
console.log(test_strings[s].match(re)[0]);
}
/*
Results:
/abc/foo/bar
/abc/foo/bar
/abc/foo/bar
/abc/foo/bar
/abc/foo/bar
/abc/foo/bar
/abc/foo/bar/sto
/abc/foo/bar/abc/foo/bar
*/

You can try something like this:
^((?:/[^/]+)+?)(?:/+|/+stop(?:/.*)?)$
demo
and if atomic groups are available, you better write:
^((?:/[^/]+)+?)(?>/+$|/+stop(?:/.*)?)
demo
If lookaheads are available:
^/(?>[^/]+|/(?!/*(?:$|stop(?:/|$))))+
demo
ps: don't forget to escape slashes if your delimiters are slashes.
As Ed Cottrell notices it, features like atomic grouping are not available in language like Javascript or in the re module of Python. However, this feature can be efficiently emulated using the fact that a lookahead is naturaly atomic: (?>a+) <=> (?=(a+))\1

Related

Trying to combine two Regex

I'm trying to combine two working regex patterns into one. Please let me know the correct syntax and if this can be better written.
Pattern 1: (?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|
Pattern 2: (?P<path>[^\/]+(?=\-[^\/-]*$))
Sample line:
06/Mar/2020:00:01:04 -0500|/TESTSTREAM|5766764|4.2.2.1|123290|path1/path2/x-fr-US.OPEN.1-Turtle-2020.30.04-64.mp3
The first expression matches the start of the string, the second matches the end, you can combine them by putting a non-greedy .*? between them, like this:
(?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|.*?(?P<path>[^\/]+(?=\-[^\/-]*$))
As you can see here this expression works, but it takes 1660 steps to match the string. This is because .* between | first capture the whole string up to the end, and then try to step back character by character in order to find the match.
If you use the non-greedy modifiers here: .*?, then the regex machine will initially match an empty string and then will need to move forward character by character until it finds the matching |. It will reduce the number of steps to 1183: demo
However, if you want to remove this backtracking (forward-tracking) at all, you can just very quickly skip as many non-| characters as possible with [^|]*. Similarly we can replace other .* patterns in the regex. The resulting regex finds a match in just 47 steps, more than 30-times less than the original regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|(?:[^\/\n]*\/)*(?P<path>.*)-.*
Demo here.
Update 2020-03-09
If you want to keep the last slash you can use this regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|.*?(?P<path>\/[^\/]*)-[^\/]*

How can I match all lines with a certain pattern, except when a certain substring is present?

I have multiple lines that have a bit of code that has a format that follow a very simple pattern: &G3FRM.GetRecord("<TAG>".GetField("<TAG>").Value. For example, I might have the following:
&G3FRM.GetRecord("PAGEREC").GetField("GSHOURS").Value
&G3FRM.GetRecord("RSCH_SETUP").GetField("Y_NIH_MNTHLY_CAP").Value
&G3FRM.GetRecord("PAYMENT").GetField("Y_HRS_TOTAL").Value
I need to match anything that has &G3FRM.GetRecord, that doesn't have PAGEREC as the first string/tag, and is then followed by the rest of the pattern. These statements can appear at the beginning, middle or end of any given line, and there could even be multiple matches in a single line.
This is the Regex pattern that I have tried:
&G3FRM\.GetRecord\("(?!PAGEREC)"\)\.GetField\("\w+"\)\.Value
As far as I understand, this is matching some literals (&G3FRM.GetRecord(") and is then looking for any string that doesn't match PAGEREC, using a negative lookahead. It certainly excludes any of the matches that have PAGEREC, but it also excludes everything else, so I know that I'm missing something.
So, I have a bunch of lines that I've cherry-picked that could look something like this:
Local string &rqst_dept_descr = %This.GetDepartmentDescription(&G3FRM.GetRecord("PAGEREC").GetField("GSREQUESTING_DEPT").Value);
Local string &hoursHTML = GetHTMLText(HTML.G_FORM_ROW_VALUE, "Hours", &G3FRM.GetRecord("PAYMENT").GetField("GSHOURS").Value);
Local string &off_cycle_deposit = &G3FRM.GetRecord("PAGEREC").GetField("GSOFFCYCLE_DIR_DEP").Value;
&G3FRM.GetRecord("POSITION").GetField("GSCOMMISSIONTIPS").Value = "Y";
SQLExec(SQL.Y_HAS_CONTRACT_DATA_IN_RANGE, &G3FRM.GetRecord("PAGEREC").GetField("EMPLID").Value, &G3FRM.GetRecord("PAYMENT").GetField("CONTRACT_NUM").Value, &G3FRM.GetRecord("PAYMENT").GetField("EFFDT").Value, &G3FRM.GetRecord("PAYMENT").GetField("EFFDT").Value, &HasContractData);
In this example, it should exclude the first line, since it only has the pattern I don't want. It should include the second line, exclude the third, include the fourth, and include the fifth (even though it does have one example of the excluded pattern, it has multiples that I do want).
You may use this regex:
&G3FRM\.GetRecord\("(?!PAGEREC\b)\w+"\)\.GetField\("\w+"\)\.Value
Note use of \w+ after negative lookahead to allow it to match a word that must not be PAGEREC1. I have added \b in your lookahead condition to make sure we don't match partial words.
RegEx Demo
In your regex &G3FRM\.GetRecord\("(?!PAGEREC)"\)\.GetField\("\w+"\)\.Value your negative lookahead condition is correct but regex is not matching anything between 2 double quotes so your regex will only match e.g. &G3FRM.GetRecord("").GetField("GSHOURS").Value.

Regex for selecting words ending in 'ing' unless

I want to select words ending in with a regular expression, but I want exclude words that end in thing. For example:
everything
running
catching
nothing
Of these words, running and catching should be selected, everything and nothing should be excluded.
I've tried the following:
.+ing$
But that selects everything. I'm thinking look aheads/look arounds could be the solution, but I haven't been able to get one that works.
Solutions that work in Python or R would be helpful.
In python you can use negative lookbehind assertion as this:
^.*(?<!th)ing$
RegEx Demo
(?<!th) is negative lookbehind expression that will fail the match if th comes before ing at the end of string.
Note that if you are matching words that are not on separate lines then instead of anchors use word boundaries as:
\w+(?<!th)ing\b
Something like \b\w+(?<!th)ing\b maybe.
You might also use a negative lookahead (?! to assert that what is on the right is not 0+ times a word character followed by thing and a word boundary:
\b(?!\w*thing\b)\w*ing\b
Regex demo | Python demo

regex to start matching from last letter in previous match

I have following regex
(\{\w*\}\s*[^{}]+\s*)\{?
and I am testing it on this string
this {match} is cool{match} but {match} this one is more cool
currently I am able to capture 2 groups -> {match} is cool and {match} this one is more cool, so as you can see group but {match} is missing.
Reason for this is because last matched character is {, so in next matching turn he will skip {, and won't be able to match until new { occurrence.
Does anyone knows how to force to match middle group also?
Debugging: http://regex101.com/r/hM5xE6/2
You can probably just remove the \{? (and the \s* too); you also don't need the capturing parentheses:
\{\w*\}[^{}]+
Test it live on regex101.com.
If you want to enforce the match to end before a { or at the end of the string, you can use a positive lookahead assertion for that:
\{\w*\}[^{}]+(?=\{|$)
But you would only need that if you wanted to avoid a match completely if there are nested braces, like in {{match} whatever}, where the first regex would find {match} whatever.
You can use the following regular expression to start matching from last letter in previous match
\{\w*\}[^{}]+(?=\{|$)
You are including the next { in the regex (but not in the match), so it begins the next match on the character after, skipping the first { and not matching until you get to the second.
There's no need for lookaheads or anything like that.
If you remove the trailing check for \{?, you get all 3 matches (can also remove the } from the brackets and the last \s*):
(\{\w*\}\s*[^{]+)
(http://regex101.com/r/hM5xE6/7)
you can also use the following regex, depending on how specific you need to be with the capture:
(\{\w*\}[\w\s]*)
http://regex101.com/r/hM5xE6/5
(\{\w*\}\s*[^{}]+\s*)(?=\{|$)
Try this.Use lookahead for 0 width assertion.See demo.
http://regex101.com/r/qC9cH4/18

Regex to match number in #define statement

I have a line like this:
#define PROG_HWNR "36084"
or this:
#define PROG_HWNR "#37595"
I'd like to extract the number (and increase it, but that's not the matter here)
I wrote a regex, but it's not working (at least in http://gskinner.com/RegExr/ )
(?<="#?)(.*?)(?=")
I also tried variations like
(?<=("#?))(.*?)(?=")
or
(?<=("|"#)))(.*?)(?=")
But no success. The problem is, that I want to match only the number, no matter if there is a # or not ...
Can you point me in the right direction? Thanks!!
Try this regex:
"#?(\d+)"$
It will match:
" a quote
#? optional hash
( (start capturing)
\d+ one or more digits
) (stop capturing)
" a quote
$ anchor to end
Here is a JSFiddle, and here is a RegExr
The problem is the variable length of the lookbehind. Only few regex engines can deal with this. Because there are only two possible lookbehinds (including the # or not), you can expand that into two lookbehinds:
(?:(?<="#)|(?<=")).*?(?=")
Note that you don't need to capture the .*? if you use lookarounds, as they are excluded from the match anyway. Also, a better way than using non-greedy .*? is to use a greedy expression that can never go past the ending delimiter:
(?:(?<="#)|(?<="))[^"]*(?=")
Alternatively (if you can access captured submatches), you can use a capturing approach and get rid of the lookarounds:
"#?([^"]*)"
Try this:
^#define \w+ "#?(\d+)"$
That will match the whole line, with the first/single group being the number you are looking for.
This is actually pretty basic regex functionality: match an optional character (?) and match a group of characters (the parentheses).
You can even go one simpler:
\d+
will match a string of digits. Only the digits. And ignore the rest of the input string.
Use this tool for testing this stuff, I found it pretty handy: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx