Regex: optimal syntax for optional combined expression? - regex

I want to match a combination of expressions that is optional. In this specific example, I want to match on the word through. Also, if the words run or swim precede through (with whitespace) then match on the whole phrase. So that combination of expressions preceding through must be optional.
I want all the following lines to be positive matches:
swim through <-- match entire phrase
jump through <-- match entire phrase
hike through <-- match only the word "through"
To do this, I can use the following expression:
(jump\W|swim\W)?through
However, is it possible to accomplish the same thing without having to add \W after jump and swim? I was trying something like this:
(jump|swim)?\W?through
But that wasn't working properly because it would include the space that precedes through on the 3rd example. I only want the word through, not the whitespace around it.

What about this one: (?:(jump|swim)\W)?through

Related

Trying to combine two Regex

I'm trying to combine two working regex patterns into one. Please let me know the correct syntax and if this can be better written.
Pattern 1: (?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|
Pattern 2: (?P<path>[^\/]+(?=\-[^\/-]*$))
Sample line:
06/Mar/2020:00:01:04 -0500|/TESTSTREAM|5766764|4.2.2.1|123290|path1/path2/x-fr-US.OPEN.1-Turtle-2020.30.04-64.mp3
The first expression matches the start of the string, the second matches the end, you can combine them by putting a non-greedy .*? between them, like this:
(?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|.*?(?P<path>[^\/]+(?=\-[^\/-]*$))
As you can see here this expression works, but it takes 1660 steps to match the string. This is because .* between | first capture the whole string up to the end, and then try to step back character by character in order to find the match.
If you use the non-greedy modifiers here: .*?, then the regex machine will initially match an empty string and then will need to move forward character by character until it finds the matching |. It will reduce the number of steps to 1183: demo
However, if you want to remove this backtracking (forward-tracking) at all, you can just very quickly skip as many non-| characters as possible with [^|]*. Similarly we can replace other .* patterns in the regex. The resulting regex finds a match in just 47 steps, more than 30-times less than the original regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|(?:[^\/\n]*\/)*(?P<path>.*)-.*
Demo here.
Update 2020-03-09
If you want to keep the last slash you can use this regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|.*?(?P<path>\/[^\/]*)-[^\/]*

How can I match all lines with a certain pattern, except when a certain substring is present?

I have multiple lines that have a bit of code that has a format that follow a very simple pattern: &G3FRM.GetRecord("<TAG>".GetField("<TAG>").Value. For example, I might have the following:
&G3FRM.GetRecord("PAGEREC").GetField("GSHOURS").Value
&G3FRM.GetRecord("RSCH_SETUP").GetField("Y_NIH_MNTHLY_CAP").Value
&G3FRM.GetRecord("PAYMENT").GetField("Y_HRS_TOTAL").Value
I need to match anything that has &G3FRM.GetRecord, that doesn't have PAGEREC as the first string/tag, and is then followed by the rest of the pattern. These statements can appear at the beginning, middle or end of any given line, and there could even be multiple matches in a single line.
This is the Regex pattern that I have tried:
&G3FRM\.GetRecord\("(?!PAGEREC)"\)\.GetField\("\w+"\)\.Value
As far as I understand, this is matching some literals (&G3FRM.GetRecord(") and is then looking for any string that doesn't match PAGEREC, using a negative lookahead. It certainly excludes any of the matches that have PAGEREC, but it also excludes everything else, so I know that I'm missing something.
So, I have a bunch of lines that I've cherry-picked that could look something like this:
Local string &rqst_dept_descr = %This.GetDepartmentDescription(&G3FRM.GetRecord("PAGEREC").GetField("GSREQUESTING_DEPT").Value);
Local string &hoursHTML = GetHTMLText(HTML.G_FORM_ROW_VALUE, "Hours", &G3FRM.GetRecord("PAYMENT").GetField("GSHOURS").Value);
Local string &off_cycle_deposit = &G3FRM.GetRecord("PAGEREC").GetField("GSOFFCYCLE_DIR_DEP").Value;
&G3FRM.GetRecord("POSITION").GetField("GSCOMMISSIONTIPS").Value = "Y";
SQLExec(SQL.Y_HAS_CONTRACT_DATA_IN_RANGE, &G3FRM.GetRecord("PAGEREC").GetField("EMPLID").Value, &G3FRM.GetRecord("PAYMENT").GetField("CONTRACT_NUM").Value, &G3FRM.GetRecord("PAYMENT").GetField("EFFDT").Value, &G3FRM.GetRecord("PAYMENT").GetField("EFFDT").Value, &HasContractData);
In this example, it should exclude the first line, since it only has the pattern I don't want. It should include the second line, exclude the third, include the fourth, and include the fifth (even though it does have one example of the excluded pattern, it has multiples that I do want).
You may use this regex:
&G3FRM\.GetRecord\("(?!PAGEREC\b)\w+"\)\.GetField\("\w+"\)\.Value
Note use of \w+ after negative lookahead to allow it to match a word that must not be PAGEREC1. I have added \b in your lookahead condition to make sure we don't match partial words.
RegEx Demo
In your regex &G3FRM\.GetRecord\("(?!PAGEREC)"\)\.GetField\("\w+"\)\.Value your negative lookahead condition is correct but regex is not matching anything between 2 double quotes so your regex will only match e.g. &G3FRM.GetRecord("").GetField("GSHOURS").Value.

Regex match multiple elements

I have regex that I am trying to match to specific function parameters. I want to be able to style them a certain way in a language package.
Here is the text I am trying to match:
addFill(path:svgjs.Element, pattern:Pattern, docMaxSide:number) {
pathFillId(path)
}
In this example, I want to match the words "path" "pattern" and "docMaxSide" from the parameters. I want to make sure it does NOT match the word "path" in the second line (where I am calling pathFillId).
Here is my current regex: \(.*?(\w+):.*?\)
Broken down:
\( Find open parens
.*? It may have stuff before it, but after the parens
(\w+): Capture a word before a colon
.*? There may be more stuff after the colon
\) Close parens
Right now, it will only match the first item, "path". But I need it to match all the words I mentioned above.
UPDATE: I should have been more specific. It should only match if it's a function parameter. For example, I don't want path1 matched in the following: var path1:string. The difficulty is coming up with regex that matches items only between parens.
Try this:
\w+(?=:)
with the g modifier (the global modifier finds all elements and don't return on the first match)
Also see the example
UPDATE
If you want only match the parameters in the parenthesis you can do this:
\w+(?=:[\w.]+\s*[,)])
Here is the example for this regex
You problem is this part of your regex: .*?. So you specify that you want any character (.), that's correct. But then you must decide for one of * and ? - * means {0,}, ? means {0,1}.
If that doesn't help, you might test your regex with regexe.com or similar.

Don't know how to use lookarounds properly to achieve my Regex match

I'm writing a perl script and part of it requires that I match all occurrences of a certain pattern in a string. Naturally, a regular expression seems like it would be powerful enough, but I just can't get it right for this particular string.
A hypothetical example of the type of text the regex might be applied to would be:
1cat;2dog;!3monkey;!4horse;
As you can see, several data entries (1cat, 2dog, etc.) are present in the line, delimited by semicolons. The beginning of the line contains no semicolon, but the end does. I want to be able to match all the stuff which hasn't been not'ed by the !. In the above example, 1cat and 2dog would be matched and returned in list context, while 3monkey and 4horse would not.
What I have tried to do so far is use negative lookbehinds to notice only the entries without a !. Something like this:
m/(?<!\!)(\w+)\;/g
However, doesn't work because the for every !'ed entry, the regex just matches what comes after it, up to the semicolon. In the example, 1cat and 2dog are captured, but then so are monkey and horse.
I feel like this is easily doable, but I'm new to regular expressions and I can't think of anything else.
Throw a word boundary (\b) in there and you should be good:
(?<!!)\b(\w+);
As you could tell your negative lookbehind was working, but it would still match everything after the next character (horse from !4horse). A word boundary is a zero-width assertion, kind of like a conditional that doesn't match anything (like anchors ^ and $). It asserts for this: (^\w|\w\W|\W\w|\w$). In other words, anytime a word character ([a-zA-Z0-9_]) is next to the beginning/end of string or a non-word character.

Find lines with same characters set

I have situation like this.
Car Driver
Cat Mouse
Door House
Driver Car
I need help with regex to find all lines with same set of characters or words no mater how placed in line.
Car Driver
Driver Car
Edited list:
A0JLS3 Q9NUA2 <
A0JLT2 Q9Y3C7
A0N0L5 P26441
A0N0Q1 O00626
A0N0Q1 P35626
A0PJF8 P27361
Q9NUA2 A0JLS3 <
EDIT: after taking a look at your file, it seems that there is one tab character after the first word and a variable number of tab characters after the second, so you must change the pattern to:
^(\w+)\h+(\w+)\h*$(?=(?>\R.*)*?\R(?:\1\h+\2|\2\h+\1)\h*$)
where \h stand for an horizontal white-character.
Since you seems to have huge files and I don't see how to not use a reluctant quantifier in the lookahead assertion, you can try to use this modified pattern where all the quantifiers are possessive (when possible), and all groups are atomic. It seems to be a little faster:
^(\w++)\h++(\w++)\h*+$(?=(?>\R.*+)*?\R(?>\1\h++\2|\2\h++\1)\h*+$)
Previous answer:
You can use this pattern:
^(\w+) (\w+)$(?=(?>\R.*)*?\R(?:\1 \2|\2 \1)$)
This will find lines that have a "duplicate line" with the two same words after in the text. If you want to use it to remove duplicate, keep in mind that this will preserve the last occurence and remove the first.
pattern details:
^(\w+) (\w+)$ : this describes a whole line (note the anchors for start ^ and end $ of the line) and put each word in a capturing group (group 1 and group 2)
The second part of the pattern checks if there is a "similar line" (a line with the same words) after. Since it is embeded in a lookahead assertion ((?=...) i.e. followed by), this part isn't included in the match result.
(?>\R.*)*?: lines until the duplicate. \R stand for CRLF or LF, and .* match all characters except newlines. The group is repeated with a lazy quantifier to stop before the first duplicate line. (note that this works with a greedy quantifier too, the best choice depends on how looks your document. For example, if duplicates are often at the end of the document, using a greedy quantifier is a better choice)
(?:\1 \2|\2 \1) describes the two possibilities using backreferences to group 1 and 2.
$ is added to ensure that the last word is whole. (otherwise something like A0N0L5 P26441 ... A0N0L5 P26441XXX will succeed)
I'm not sure exactly what you are trying to achieve. If you're looking for all lines containing both of the words Car and Driver, you can mark all lines containing this regular expression:
Car Driver|Driver Car
Here's a guide on regular expressions in Notepad++: http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions
And consider taking a look at the Stack Overflow Regular Expressions FAQ for some more useful information.