PCRE Regex: Exclude last portion of word - regex

I am trying to write a regex expression in PCRE which captures the first part of a word and excludes the second portion. The first portion needs to accommodate different values depending upon where the transaction is initiated from. Here is an example:
Raw Text:
.controller.CustomerDemographicsController
Regex Pattern Attempted:
\.controller\.(?P<Controller>\w+)
Results trying to achieve (in bold is the only content I want to save in the named capture group):
.controller.CustomerDemographicsController
NOTE: I've attempted to exclude using ^, lookback, and lookforward.
Any help is greatly appreciated.

You can match word chars in the Controller group up to the last uppercase letter:
\.controller\.(?P<Controller>\w+)(?=\p{Lu})
See the regex demo. Details:
\.controller\. - a .controller\. string
(?P<Controller>\w+) - Named capturing group "Controller": one or more word chars as many as possible
(?=\p{Lu}) - the next char must be an uppercase letter.
Note that (?=\p{Lu}) makes the \w+ stop before the last uppercase letter because the \w+ pattern is greedy due to the + quantifier.

Also, use
\.controller\.(?P<Controller>[A-Za-z]+)[A-Z]
See proof.
EXPLANATION:
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
controller 'controller'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
(?P<Controller> group and capture to Controller:
--------------------------------------------------------------------------------
[A-Za-z]+ any character of: 'A' to 'Z', 'a' to 'z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of Controller group
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'

Related

regex if (text contain this text) match this

I have these two sentence
TAGGING ODP:-7.160792, 113.496069
TAGGING pel:-7.160792, 113.496069
I want to match -7.160792 part only if the full sentence contain "odp" in it.
I tried the following (?(?=odp)-\d+.\d+) but it doesn't work, i don't know why.
Any help is appreciated.
(?(?=odp)-\d+\.\d+) won't work because (?=odp) is a positive lookahead that imposes a constraint on the pattern on the right, -\d+\.\d+. Namely, it requires odp string to occur exactly at the same location where - and a number are expected.
Use
(?<=ODP:)-\d+\.\d+
ODP:(-\d+\.\d+)
If lookbehinds are supported, the first variant is more viable.
Otherwise, another option with capturing groups is good to use.
And if odp can appear anywhere, even after the number:
(?i)^(?=.*odp).*(-\d+\.\d+)
This will capture the value into a group.
EXPLANATION
--------------------------------------------------------------------------------
(?i) set flags for this block (case-
insensitive) (with ^ and $ matching
normally) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
odp 'odp'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \1
You can use the regex, (?i)(?<=odp:)[^,]*.
Explanation:
(?i): Case-insenstitive flag
(?<=odp:): Positive lookbehind for odp:
[^,]*: Anything but ,
👉 If you want the match to be restricted to numbers only, you can use the regex, (?i)(?<=odp:)(?:-\d+.\d+)
Explanation:
(?i): Case-insenstitive flag
(?<=odp:): Positive lookbehind for odp:
(?:: Start non capturing group
-: Literal, -
\d+: 1+ digit(s)
.\d+: . followed by 1+ digit(s)
): End non capturing group
👉 If the sign can be either + or -, you can use the regex, (?i)(?<=odp:)(?:[+-]\d+.\d+)
The pattern (?(?=odp)\-\d+\.\d+) is using a conditional (? stating in the if clause:
If what is directly to the right from the current position is odp,
then match -\d+.\d+
That can not match.
What you also could do is match odp followed by any char other than a digit using \D* and capture the digit part in a group.
\bodp\b\D*(-\d+\.\d+)\b
The pattern matches:
\bodp\b match odp between word boundaries to prevent a partial match
\D* Optionally match any char other than a digit
(-\d+\.\d+) Capture - and 1+ digits with a decimal part in group 1
\b A word boundary
Regex demo
(?<=ODP:)(-\d+.\d+)
You can try using the negative look behind.
This should solve for the code you ve provided.

Regex to pick the alias from email address

I need to identify all email addresses in a given cell enclosed in any special character, written in any number of multiple lines.
This is something that I built.
"(!\s<,;-)[a-zA-Z0-9]*#"
Is there any improvement?
The pattern (!\s<,;-)[a-zA-Z0-9]*# starts with capturing !\s<,;- literally. If you want to match 1 of the listed characters, you can use a character class [!\s<,;-] instead.
If you want to match xyz123 in xyz123#gmail.com you can use:
[a-zA-Z0-9]+(?=#)
The pattern matches
[a-zA-Z0-9]+ Match 1+ occurrences of any of the listed ranges
(?=#) Assert (not match) an # directly to the right of the current position
See a regex demo.
Use
([a-zA-Z0-9]\w*)#
See regex proof
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-zA-Z0-9] any character of: 'a' to 'z', 'A' to
'Z', '0' to '9'
--------------------------------------------------------------------------------
\w* word characters (a-z, A-Z, 0-9, _) (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
# '#'

Match if the line has two or more of the same capitalized word

Basically I want to match this:
So this. So that. [this should match]
Yes this. No that. [this shouldn't match]
I thought this would work:
(\b(\w+)\1\b.*){2,}
But right now, it's matching the second line too: https://regexr.com/5jhag
Why is this and how to fix it?
Match if the line has two or more of the same capitalized word
As you want to match capitalized words only a \w is not right because it matches [a-zA-Z0-9_] characters. Also using \1 just after the capture group means consecutive repeats only. Finally \b is also required around matches.
You may use this regex:
\b([A-Z]\w*)\b.*\b\1\b
RegEx Demo
RegEx Details:
\b: Word boundary
([A-Z]\w*): Match a capitalize word that start with uppercase letter followed by 0 or more of any word characters
\b: Word boundary
.*: Match 0 or more of any characters
\b\1\b: Match same word as what we captured in group #1 surrounded with word boundaries
(\b(\w+)\1\b.*){2,} is a repeated capturing group. \1 is a backreference that references the value of the group it is defined in and it is always assigned an empty string, at each iteration. Note: if you were to test with PCRE engine, there would be no match, see proof, because \1 is not empty, it is null and there is no match.
Your regex matches Yes this. No that. because the current expression is equal to (\b(\w+)\b.*){2,} and matches any word, then any text, two times or more.
Use
.*\b([A-Z][a-zA-Z]+)\b.*\b\1\b.*
See proof.
Unicode version:
.*\b(\p{Lu}\p{L}+)\b.*\b\1\b.*
See another proof.
Explanation
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
[a-zA-Z]+ any character of: 'a' to 'z', 'A' to 'Z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))

Matching regex after colon but before underscore

I have two strings below which i need to apply a regex function to in Google BigQuery with its desired outputs: Input:
MERCURE ENGAGEMENT_LaL_FB_TALENT:HENRIQUE_PORTUGAL_WEEK 4_IMAGE CAROUSEL_I19
MERCURE ENGAGEMENT_LaL_FB_UGC:_ENGLAND_TBC_WEEK 4_IMAGE CAROUSEL_I25
Output:
HENRIQUE
ENGLAND
I cannot use a reverse or positive look ahead within bigquery.
The closest I have gotten is the following:
:\D*
Which matches the word after the colon but before the white space.
Any ideas helpful
You might also use a capturing group with with REGEXP_EXTRACT.
:_?([^\s_]+)
Explanation
:_? Match : and an optional underscore
( Capture group 1
[^\s_]+ Match 1+ times any char other than a whitespace char or an underscore (Omit \s if there can also be spaces in between)
) Close group 1
Regex demo
You could also exclude matching an underscore from a word character which narrows down the range of accepted characters.
:_?([^\W_]+)
One approach uses REGEXP_REPLACE:
SELECT REGEXP_REPLACE(col, r'^.*:_?([^_]+)_.*$', r'\1') AS output
FROM yourTable;
Use
REGEXP_EXTRACT("column_name", r":[^a-zA-Z]*([a-zA-Z]+)")
See regex proof
Explanation
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
[^a-zA-Z]* any character except: 'a' to 'z', 'A' to
'Z' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-zA-Z]+ any character of: 'a' to 'z', 'A' to 'Z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1

Trying to match what is before /../ but after / with regular expressions

I am trying to match what is before /../ but after / with a regular expressions, but I want it to look back and stop at the first /
I feel like I am close but it just looks at the first slash and then takes everything after it like... input is this:
this/is/a/./path/that/../includes/face/./stuff/../hat
and my regular expression is:
#\/(.*)\.\.\/#
matching /is/a/./path/that/../includes/face/./stuff/../ instead of just that/../ and stuff/../
How should I change my regex to make it work?
.* means "match any number of any character at all[1]". This is not what you want. You want to match any number of non-/ characters, which is written [^/]*.
Any time you are tempted to use .* or .+ in a regex, be very suspicious. Stop and ask yourself whether you really mean "any character at all[1]" or not - most of the time you don't. (And, yes, non-greedy quantifiers can help with this, but character classes are both more efficient for the regex engine to match against and more clear in their communication of your intent to human readers.)
[1] OK, OK... . isn't exactly "any character at all" - it doesn't match newline (\n) by default in most regex flavors - but close enough.
Change your pattern that only characters other than / ([^/]) get matched:
#([^/]*)/\.\./#
Alternatively, you can use a lookahead.
#(\w+)(?=/\.\./)#
Explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
) end of look-ahead
I think you're essentially right, you just need to make the match non-greedy, or change the (.*) to not allow slashes: #/([^/]*)/\.\./#
In your favourite language, do a few splits and string manipulation eg Python
>>> s="this/is/a/./path/that/../includes/face/./stuff/../hat"
>>> a=s.split("/../")[:-1] # the last item is not required.
>>> for item in a:
... print item.split("/")[-1]
...
that
stuff
In python:
>>> test = 'this/is/a/./path/that/../includes/face/./stuff/../hat'
>>> regex = re.compile(r'/\w+?/\.\./')
>>> regex.findall(me)
['/that/..', '/stuff/..']
Or if you just want the text without the slashes:
>>> regex = re.compile(r'/(\w+?)/\.\./')
>>> regex.findall(me)
['that', 'stuff']
([^/]+) will capture all the text between slashes.
([^/]+)*/\.\. matches that\.. and stuff\.. in you string of this/is/a/./path/that/../includes/face/./stuff/../hat It captures that or stuff and you can change that, obviously, by changing the placement of the capturing parens and your program logic.
You didn't state if you want to capture or just match. The regex here will only capture that last occurrence of the match (stuff) but is easily changed to return that then stuff if used global in a global match.
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^/]+ any character except: '/' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'