RegEx expression to keep first appearance of word grouping - regex

I have the following RegEx (\[[^]]+])(?=.*\1)
which identifies the first set of appearances of a duplicate word group inside a string (each word group is enclosed between [ ] brackets). However, I am trying to come up with a RegEx that identifies the last set of appearances of a duplicate word group. Reason being, I need to remove duplicate word groups while retaining the order in which each group appears in the overall string.
Using the following string as an example whereby only [John Smith] and [Jane Doe] are duplicate word groups:
[John Smith][John Smith][Mr. Smith][Jane Doe][Mrs. Doe][John Smith][Jane Doe][Doe][John][Smith John][John Smith Sr]
After using my RegEx in a RegEx Replace formula, I get the below:
[Mr. Smith][Mrs. Doe][John Smith][Jane Doe][Doe][John][Smith John][John Smith Sr]"
However, I need my RegEx Replace formula to give me:
[John Smith][Mr. Smith][Jane Doe][Mrs. Doe][Doe][John][Smith John][John Smith Sr]
I have tried many ways to achieve the latter with no luck. Thanks in advance.

Considering infinite-width lookbehinds:
(\[[^\][]+])(?<=\1.*\1)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
[^\][]+ any character except: '\]', '[' (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
] ']'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
) end of look-behind

Many regex engines do not support variable-length lookbehinds, but most support variable-length lookaheads. When working with an engine that supports variable-length lookaheads, but not variable-length lookbehinds, one approach that often works is to reverse the string, modify the (reversed) string with a regex and then reverse the resulting string. That approach could be used here.
Suppose, for example, the string were
"[John Smith][John Smith][Mr. Smith][Jane Doe][John Smith][Jane Doe][Doe]"
Reversing the string produces
"]eoD[]eoD enaJ[]htimS nhoJ[]eoD enaJ[]htimS .rM[]htimS nhoJ[]htimS nhoJ["
We now convert matches of the following expression to empty strings:
(\][^[]*\[)(?=.*\1)
which produces
"]eoD[]eoD enaJ[]htimS .rM[]htimS nhoJ["
Demo
Lastly, we reverse that string to obtain
"[John Smith][Mr. Smith][Jane Doe][Doe]"
The regular expression can be written in free-spacing mode to make it self-documenting:
( # begin capture group 1
\] # match ']'
[^[]* # match one or more (as many as possible) chars other than '['
\[ # match '['
) # end capture group 1
(?= # begin a positive lookahead
.* # match one or more (as many as possible) chars
\1 # match the content of capture group 1
) # end the positive lookahead
At first glance this may seem a kludge, but since reversing a string is so easy in any language it does provide a useful work-around in some cases. Mind you, doing what you want to do here in code is pretty easy in most languages. In Ruby, for example, you could write (str being a variable holding the string)
str.scan(/\[[^\]]*\]/).uniq.join

Related

PCRE Regex: Exclude last portion of word

I am trying to write a regex expression in PCRE which captures the first part of a word and excludes the second portion. The first portion needs to accommodate different values depending upon where the transaction is initiated from. Here is an example:
Raw Text:
.controller.CustomerDemographicsController
Regex Pattern Attempted:
\.controller\.(?P<Controller>\w+)
Results trying to achieve (in bold is the only content I want to save in the named capture group):
.controller.CustomerDemographicsController
NOTE: I've attempted to exclude using ^, lookback, and lookforward.
Any help is greatly appreciated.
You can match word chars in the Controller group up to the last uppercase letter:
\.controller\.(?P<Controller>\w+)(?=\p{Lu})
See the regex demo. Details:
\.controller\. - a .controller\. string
(?P<Controller>\w+) - Named capturing group "Controller": one or more word chars as many as possible
(?=\p{Lu}) - the next char must be an uppercase letter.
Note that (?=\p{Lu}) makes the \w+ stop before the last uppercase letter because the \w+ pattern is greedy due to the + quantifier.
Also, use
\.controller\.(?P<Controller>[A-Za-z]+)[A-Z]
See proof.
EXPLANATION:
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
controller 'controller'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
(?P<Controller> group and capture to Controller:
--------------------------------------------------------------------------------
[A-Za-z]+ any character of: 'A' to 'Z', 'a' to 'z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of Controller group
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'

REXEX match a String and select up to a char

I am trying to create a regular expression where I can match the initial of the string and then replace anything after an specific char.... like for example:
String = 123456789:0:0 => Output = 123456789:2:4
I need a regex where it need to match "123" in the begging then replace only "0:0" by another String.
to match "123" is easy: ^123, but I cannot find a way after this to go up to : and replace only the rest of string.
I would appreciate any help.
You can use a negated character class to match up till the first occurrence of a colon.
In the replacement use capture group 1, followed by the replacement.
^(123[^:]*:)0:0$
^ Start of string
(123[^:]*:) Match 123 followed by 0+ times any char except : using a negated character class
0:0 Match literally
$ End of string
Regex demo
If you want to replace all after matching the first colon, you could use .+
^(123[^:]*:).+
See another regex demo
Without knowing your exact programming context it seems like you're using a version of a regex_replace function. With suitable grouping this is easy to do.
Don't think of what you want to replace. Think about what you want to keep.
^(.*?123.*?:)(0:0)(.*?)$
And as your replacement string use
$12:4$3
For replacing the "0:0", you can use:
Numbers between 123 and 0:0 => /^(123[0-9]+:)0:0/ replace ${1}2:4
OR
Anything between 123 and 0:0 => /^(123.+:)0:0/ replace ${1}2:4
This RegExp creates a group starting with 123 until it reaches 0:0 and later we use the group in the replacement string. This however depends on the programming language you use:
Example in PHP
$result = preg_replace("/^(123[0-9]+:)0:0/", "\${1}2:4", "123456789:0:0");
Example in JavaScript
let result = "123456789:0:0".replace(/^(123[0-9]+:)0:0/, "$12:4");
With JavaScript:
const str = `123456789:0:0`;
console.log(str.replace(/(?<=^123[^:]*:).*/, `2:4`));
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
123 '123'
--------------------------------------------------------------------------------
[^:]* any character except: ':' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))

How to Grep Search two occurrences of a character in a lookbetween

I seem to have to perpetually relearn Regex & Grep syntax every time I need something advanced. This time, even with BBEDIT's pattern playground, I can't work this one out.
I need to do a multi-line search for the occurrence of two literal asterisks anywhere in the text between a pair of tags in a plist/XML file.
I can successfully construct a lookbetween so:
(?s)(?<=<array>).*?(?=</array>)
I try to limit that to only match occurrences in which two asterisks appear between tags:
(?s)(?<=<array>).*?[*]{2}.*?(?=</array>)
(?s)(?<=<array>).+[*]{2}.+(?=</array>)
(?s)(?<=<array>).+?[*]{2}.+?(?=</array>)
But they find nought. And when I remove the {2} I realize I'm not even constructing it right to find occurrences of one asterisk. I tried escaping the character /* and [/*] but to no avail.
How can i match any occurrence of blah blah * blah blah * blah blah ?
[*]{2} means the two asterisks must be consecutive.
(.*[*]){2} is what you're looking for - it contains two asterisks, with anything in between them.
But we also need to make sure the regex is only testing one tag closure at the same time, so instead of .*, we need to use ((?!<\/array>).)* to make sure it won't consume the end tag </array> while matching .*
The regex can be written as:
(?s)(?<=<array>)(?:((?!<\/array>).)*?[*]){2}(?1)*
See the test result here
Use
(?s)(?<=<array>)(?:(?:(?!<\/?array>)[^*])*[*]){2}.*?(?=</array>)
See proof.
Explanation
NODE
EXPLANATION
(?s)
set flags for this block (with . matching \n) (case-sensitive) (with ^ and $ matching normally) (matching whitespace and # normally)
(?<=
look behind to see if there is:
  <array>
'<array>'
)
end of look-behind
(?:
group, but do not capture (2 times):
(?:
group, but do not capture (0 or more times (matching the most amount possible)):
(?!
look ahead to see if there is not:
</?array>
</array> or <array>
)
end of look-ahead
[^*]
any character except: '*'
)*
end of grouping
[*]
any character of: '*'
){2}
end of grouping
.*?
any character (0 or more times (matching the least amount possible))
(?=
look ahead to see if there is:
</array>
'</array>'
)
end of look-ahead

How to match a string between two boundaries when only one or none may be present using a regex

I want to match for any string of characters between two words ("Hello" and "Goodbye" in the following examples) using a regex.
The bolded areas in the following list should match:
Hello, I like you. Goodbye.
Hello there, do you enjoy golf?
I like you. Goodbye. See you later.
Examples of strings that should not match at all include (basically I want to treat the words "Hello" and "Goodbye" as a kind of barrier):
HelloGoodbye
Goodbye, how are you?
How are you? Hello
I tried using (?<=Hello).*(?=Goodbye), which works in some cases (see here). The issue with this regex is that if for example "Goodbye" isn't present, none of the text after "Hello" matches (and vice versa).
I'm not exactly sure that the regex I have tried is a good way to go about it. Possibly, I just need to match any part of the string that follows "Hello" and/or preceeds "Goodbye" (but neither need to be present for a match).
I believe I need to have some kind of conditional, and I guess matching the first two is easy but I am unable to find a way to do it.
Any help would be appreciated as I am still new to using regular expressions.
Use
(?<=Hello|^)(?:(?!Hello|Goodbye).)+(?=Goodbye|$)
See proof
EXPLANATION
EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
Hello 'Hello'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
Hello 'Hello'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Goodbye 'Goodbye'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)+ end of grouping
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
Goodbye 'Goodbye'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
As an alternative, not quite as sophisticated as the accepted answer and matching differently in case of repetitive boundary words "Hello" and "Goodbye", but maybe a bit easier to understand because it just uses a lazy/reluctant quantifier *? for the match and does not resort to look-behind or look-ahead:
^(?:.*Hello)?(.*?)(?:Goodbye.*)?$
The non-capturing groups starting with (?: make sure that group 1 matches what you need. If you do not mind using group 2, you do not need to use non-capturing groups at all. Keep it simple! Then the regex would read:
^(.*Hello)?(.*?)(Goodbye.*)?$
You can test the first regex here.
See also this regex quantifier cheat sheet.

Trying to match what is before /../ but after / with regular expressions

I am trying to match what is before /../ but after / with a regular expressions, but I want it to look back and stop at the first /
I feel like I am close but it just looks at the first slash and then takes everything after it like... input is this:
this/is/a/./path/that/../includes/face/./stuff/../hat
and my regular expression is:
#\/(.*)\.\.\/#
matching /is/a/./path/that/../includes/face/./stuff/../ instead of just that/../ and stuff/../
How should I change my regex to make it work?
.* means "match any number of any character at all[1]". This is not what you want. You want to match any number of non-/ characters, which is written [^/]*.
Any time you are tempted to use .* or .+ in a regex, be very suspicious. Stop and ask yourself whether you really mean "any character at all[1]" or not - most of the time you don't. (And, yes, non-greedy quantifiers can help with this, but character classes are both more efficient for the regex engine to match against and more clear in their communication of your intent to human readers.)
[1] OK, OK... . isn't exactly "any character at all" - it doesn't match newline (\n) by default in most regex flavors - but close enough.
Change your pattern that only characters other than / ([^/]) get matched:
#([^/]*)/\.\./#
Alternatively, you can use a lookahead.
#(\w+)(?=/\.\./)#
Explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
) end of look-ahead
I think you're essentially right, you just need to make the match non-greedy, or change the (.*) to not allow slashes: #/([^/]*)/\.\./#
In your favourite language, do a few splits and string manipulation eg Python
>>> s="this/is/a/./path/that/../includes/face/./stuff/../hat"
>>> a=s.split("/../")[:-1] # the last item is not required.
>>> for item in a:
... print item.split("/")[-1]
...
that
stuff
In python:
>>> test = 'this/is/a/./path/that/../includes/face/./stuff/../hat'
>>> regex = re.compile(r'/\w+?/\.\./')
>>> regex.findall(me)
['/that/..', '/stuff/..']
Or if you just want the text without the slashes:
>>> regex = re.compile(r'/(\w+?)/\.\./')
>>> regex.findall(me)
['that', 'stuff']
([^/]+) will capture all the text between slashes.
([^/]+)*/\.\. matches that\.. and stuff\.. in you string of this/is/a/./path/that/../includes/face/./stuff/../hat It captures that or stuff and you can change that, obviously, by changing the placement of the capturing parens and your program logic.
You didn't state if you want to capture or just match. The regex here will only capture that last occurrence of the match (stuff) but is easily changed to return that then stuff if used global in a global match.
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^/]+ any character except: '/' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'