ReGex, How to find second instance of string - regex

If I want to get the Name between “for” and “;” which is NISHER HOSE, can you help me find the correct regex expression as there is more than one "for’ and “;” in the string
Data Owner Approval Needed for Access Request #: 2137352 for NISHER HOSE; CONTRACTOR; Manager: MUILLER, TIM (TWM0069)
Using the regular expression (?<=for).*(?=;) I get the wrong match Access Request #: 2137352 for NISHER HOSE; CONTRACTOR - see screenshot on https://www.regextester.com/
Thanks

If you only want to assert for on the left, you should and make sure to not match for again and you should exclude matching a ; while asserting there is one at the right.
(?<=\bfor )(?:(?!\bfor\b)[^;])+(?=;)
Explanation
(?<=\bfor ) Assert for at the left
(?:(?!\bfor\b)[^;])! Match 1+ times any char except ; if from the current position not directly followed by for surrounded by word boundaries
(?=;) Assert ; directly at the right
Regex demo

Use
(?<=\bfor )(?![^;]*\bfor\b)[^;]+
See proof.
Explanation
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
for 'for '
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[^;]* any character except: ';' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
for 'for'
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[^;]+ any character except: ';' (1 or more times
(matching the most amount possible))

The main issue here is that there are two "for". If you want to catch the name then use the ":" as a delimiter to catch the second "for":
Regex: /:.*for(.+?);/gm
Demo: https://regex101.com/r/p3QY0o/1
The name will be captured in group 1. If you decide to use a lookahead/lookbehind just bear in mind that these may or may not be supported depending on the regex engine.

Related

Regex starting with certain set of characters

I have a requirement where the regex has to contains only certain set of characters .
For example requirement is that string can start with
JIRA-<5 digit number> or PROJ-<5 digit number>
This means allowed values can be as:
JIRA-12345
PROJ-98765
I tried regex as
(\JIRA-[0-9]+)|(\ PROJ-[0-9]+)
This seems to be not working, please suggest on how to proceed on this.
Thanks
Use
\b(?:JIRA|PROJ)-\d{5}\b
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
JIRA 'JIRA'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
PROJ 'PROJ'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
\d{5} digits (0-9) (5 times)
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char

How to exclude brackets at the end of the Url

I am new to regex, so any help is really appreciated.
I have an expression to identify a URL :
(http[^'\"]+)
Unfortunately on some URLs, I get additional square brackets at the end
For instance "http://example.com]]"
As the result want to receive "http://example.com"
How do I get rid of those brackets with the help of the regex I wrote above?
What you actually have is called a negated character class, so just add characters that should not be matched. In addition, there's not really a need for a capturing group. That said, you could use
http[^'"\]\[]+
# ^^^^
Note that this will exclude square brackets anywhere in your possible url not just at the end. See a demo on regex101.com.
Stop the match between a word and nonword character:
(http[^'"]+)\b
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
http 'http'
--------------------------------------------------------------------------------
[^'"]+ any character except: ''', '"' (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char

PCRE Regex: Exclude last portion of word

I am trying to write a regex expression in PCRE which captures the first part of a word and excludes the second portion. The first portion needs to accommodate different values depending upon where the transaction is initiated from. Here is an example:
Raw Text:
.controller.CustomerDemographicsController
Regex Pattern Attempted:
\.controller\.(?P<Controller>\w+)
Results trying to achieve (in bold is the only content I want to save in the named capture group):
.controller.CustomerDemographicsController
NOTE: I've attempted to exclude using ^, lookback, and lookforward.
Any help is greatly appreciated.
You can match word chars in the Controller group up to the last uppercase letter:
\.controller\.(?P<Controller>\w+)(?=\p{Lu})
See the regex demo. Details:
\.controller\. - a .controller\. string
(?P<Controller>\w+) - Named capturing group "Controller": one or more word chars as many as possible
(?=\p{Lu}) - the next char must be an uppercase letter.
Note that (?=\p{Lu}) makes the \w+ stop before the last uppercase letter because the \w+ pattern is greedy due to the + quantifier.
Also, use
\.controller\.(?P<Controller>[A-Za-z]+)[A-Z]
See proof.
EXPLANATION:
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
controller 'controller'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
(?P<Controller> group and capture to Controller:
--------------------------------------------------------------------------------
[A-Za-z]+ any character of: 'A' to 'Z', 'a' to 'z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of Controller group
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'

Match if the line has two or more of the same capitalized word

Basically I want to match this:
So this. So that. [this should match]
Yes this. No that. [this shouldn't match]
I thought this would work:
(\b(\w+)\1\b.*){2,}
But right now, it's matching the second line too: https://regexr.com/5jhag
Why is this and how to fix it?
Match if the line has two or more of the same capitalized word
As you want to match capitalized words only a \w is not right because it matches [a-zA-Z0-9_] characters. Also using \1 just after the capture group means consecutive repeats only. Finally \b is also required around matches.
You may use this regex:
\b([A-Z]\w*)\b.*\b\1\b
RegEx Demo
RegEx Details:
\b: Word boundary
([A-Z]\w*): Match a capitalize word that start with uppercase letter followed by 0 or more of any word characters
\b: Word boundary
.*: Match 0 or more of any characters
\b\1\b: Match same word as what we captured in group #1 surrounded with word boundaries
(\b(\w+)\1\b.*){2,} is a repeated capturing group. \1 is a backreference that references the value of the group it is defined in and it is always assigned an empty string, at each iteration. Note: if you were to test with PCRE engine, there would be no match, see proof, because \1 is not empty, it is null and there is no match.
Your regex matches Yes this. No that. because the current expression is equal to (\b(\w+)\b.*){2,} and matches any word, then any text, two times or more.
Use
.*\b([A-Z][a-zA-Z]+)\b.*\b\1\b.*
See proof.
Unicode version:
.*\b(\p{Lu}\p{L}+)\b.*\b\1\b.*
See another proof.
Explanation
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
[a-zA-Z]+ any character of: 'a' to 'z', 'A' to 'Z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))

Non-Greedy Single Character Match Regex

I'm doing a non-greedy match like this
'(?<C2>.+?)'
to find a group inside a quotes. This works well, until I want to do something like this
'(?<C2>.+?)' as
to match something in quotes followed by a space, following by the word as.
But now, the following will not match as desired
'hello'123'hello2' as
I want this to not match at all...but it ends up matching the whole chunk
'hello'123'hello2'
as C2
What's the best way to force the non-greedy .+? to include up to the first occurance of a ', not the first occurance of ' as
This seems to work
(?<C2>'[^']+')(?= as)
Explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
(?<C2> group and capture to C2:
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
[^']+ any character except: ''' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
as ' as'
--------------------------------------------------------------------------------
) end of look-ahead
Even without the lookahead (?= as), (?<C2>'[^']+') will match quoted strings in a non-greedy way as expected.
You can try;
'(?<C2>[^']+?)' as
I think I understood your question differently than those who have replied so far. By
What's the best way to force the non-greedy .+? to include up to the first occurance of a ', not the first occurance of ' as
did you mean to say you wanted to match the word between the first two ', i.e. hello, not hello2? In that case, this is my suggestion:
'(?<C2>.+?)'(?! as)
The negative lookahead will ensure that you will not match the word which comes before as.
In case I misunderstood your request: sorry.