Regex for matching groups but excluding a specific combination of groups - regex

I'm trying to match two groups in an expression, each group represents a single letter in initials as part of a name, for example in George R. R. Martin the first group would match the first R and the second group would match the second R, I have something like this:
\b([a-zA-Z])[\.{0,1} {0,1}]{1,2}([a-zA-Z])\b
However, I'd like to exclude a specific combination of those groups, say when the first group matches the letter d and the second group matches the letter r.
Is that possible?

You may restrict matches with a negative lookahead:
\b(?![dD]\.? ?[rR]\b)([a-zA-Z])\.? ?([a-zA-Z])\b
^^^^^^^^^^^^^^^^^^^
See the regex demo
Note:
The (?![dD]\.? ?[rR]\b) lookahead should be better placed after the word boundary, so that the check only gets triggered upon encountering a word boundary, not at every location in string
The lookahead is negative, it fails the match if the pattern inside it matches the text
It matches: a d or D with [dD], then an optional literal dot with \.?, an optional space with ?, an r or R with [rR] and a trailing word boundary \b.
The main pattern is a more generic pattern - \b([a-zA-Z])\.? ?([a-zA-Z]):
\b - leading word boundary
(?![dD]\.? ?[rR]\b) - the negative lookahead
([a-zA-Z]) - Group 1 capturing an ASCII letter
\.? - an optional dot
? - an optional space
([a-zA-Z]) - Group 2 capturing an ASCII letter
\b - a trailing word boundary

Related

Regex to add underscore between number and unit (or replace whitespace with underscore between number and unit)

I have a long text that contains data like:
23cm,
23m,
60 cm,
60 m,
So sometimes there is a space between number and unit. Sometimes there isn't one.
How to add an underscore in each case, so the result would be:
23_cm,
23_m,
60_cm,
60_m
The search pattern for a part of it is probably (\d) (?:cm|m), but I can figure out the rest.
We can use capturing groups. The following example uses \2 and \3 for the capturing groups. Some languages would use $2 and $3.
See https://regex101.com/r/KxYyrb/1
input string
23cm, 23m, 60 cm, 60 m,
pattern
((\d+)\s?(m|cm))
replace using
\2_\3
output
23_cm, 23_m, 60_cm, 60_m,
You can use
(\d)\s?(c?m)\b
The replacement pattern is $1_$2.
See the regex demo.
Details:
(\d) - Capturing group 1: a digit
\s? - an optional whitespace char
(c?m) - Capturing group 2: an optional c and an m
\b - a word boundary (else, the regex will match m in men, for example).
I suggest replacing matches of
(?<=\d) ?(?=c?m,)
with an underscore. If a space is present it is matched; else the (zero-width) location between the last digit and 'cm' or 'm' is matched.
Demo
The regular expression can be broken down as follows. (I have enclosed the space in a character class to make it visible to the reader.)
(?<= # begin a positive lookbehind
\d # match a digit
) # end positive lookbehind
[ ]? # optionally match a space
(?= # begin a positive lookahead
c?m, # optionally match a 'c' followed by 'm,'
) # end positive lookahead
If the comma is not always present replace (?=c?m,) with (?=c?m\b), \b being a word boundary.

Regex match specific strings

I want to capture all the strings from multi lines data. Supposed here the result and here’s my code which does not work.
Pattern: ^XYZ/[0-9|ALL|P] I’m lost with this part anyone can help?
Result
XYZ/1
XYZ/1,2-5
XYZ/5,7,8-9
XYZ/2-4,6-8,9
XYZ/ALL
XYZ/P1
XYZ/P2,3
XYZ/P4,5-7
XYZ/P1-4,5-7,8-9
Changed to
XYZ/1
XYZ/1,2-5
XYZ/5,7,8-9
XYZ/2-4,6-8,9
XYZ/A12345 after the slash limited to 6 alphanumeric chars
XYZ/LH-1234567890 after the /LH- limited to 10 numeric chars
The pattern could be:
^XYZ\/(?:ALL|P?[0-9]+(?:-[0-9]+)?(?:,[0-9]+(?:-[0-9]+)?)*)$
The pattern in parts matches:
^ Start of string
XYZ\/ Match XYX/ (You don't have to escape the / depending on the pattern delimiters)
(?: Outer on capture group for the alternatives
ALL Match literally
| Or
P? Match an optional P
[0-9]+(?:-[0-9]+)? Match 1+ digits with an optional - and 1+ digits
(?: Non capture group to match as a whole
,[0-9]+(?:-[0-9]+)? Match ,and 1+ digits and optional - and 1+ digits
)* Close the non capture group and optionally repeat it
) Close the outer non capture group
$ End of string
Regex demo
You can use this regex pattern to match those lines
^XYZ\/(?:P|ALL|[0-9])[0-9,-]*$
Use the global g and multiline m flags.
Btw, [P|ALL] doesn't match the word "ALL".
It only matches a single character that's a P or A or L or |.

What is the proper regex for capturing everything after "String" and between two delimeters ('=' and and non alphanumeric))

Details={
AwsEc2SecurityGroup={GroupName=m.com-rds, OwnerId=123, VpcId=vpc-123,
IpPermissions=[{FromPort=3306, ToPort=3306, IpProtocol=tcp, IpRanges=[{CidrIp=1.1.1.1/32}, {CidrIp=2.2.2.2/32}, {CidrIp=0.0.0.0/0}, {CidrIp=3.3.3.3/32}],
UserIdGroupPairs=[{UserId=123, GroupId=sg-123abc}]}], IpPermissionsEgress=[{IpProtocol=-1, IpRanges=[{CidrIp=0.0.0.0/0}]}], GroupId=sg-123abc}},
Region=us-east-1, Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc}]
}
I want to capture exactly arn:aws:ec2:us-east-1:123:security-group/sg-123abc in this example. Generically, I want to capture the value of Id regardless of placement. My current solution is /Details={.*Id=(.*\w)/, but this only works if it's the last object in the data. How can I take into account the following potential scenario:
Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc, Thing=123abc}]
You have a pattern with 2 times .* which will first match till the end of the line/string (depending on if the dot matches a newline) and it will backtrack to match the last occurrence where this part of the pattern Id=(.*\w) can match.
If you want to use a capture group, you can make the format and the allowed characters a bit more specific:
\bId=(\w+(?:[:\/-]\w+)+)
The pattern in parts
\b A word boundary to prevent a partial word match
Id= Match literally
( Capture group 1
\w+ Match 1+ word chars
(?:[:\/-]\w+)+ Repeat 1+ times either : / - and 1+ word chars
) Close group 1
Regex demo
Or if you know that it starts with Id=arn:
\bId=(arn:[\w:\/-]+)
Regex demo
Note that you don't have to escape the \/ only when the delimiters of the regex are forward slashes, but there is no language tagged.
You can use look-behind to check that there is the Id= prefix, and then match anything that is not a space, comma or closing brace:
(?<=\bId=)[^,}\s]*

Negating duplicate words pattern

I am new to regex and have the following pattern that detects duplicate words separated with dashes
\b(\w+)-+\1\b
// matches: hey-hey
// not matches: hey-hei
What I really need is a negated version of this pattern.
I've tried negative lookahead, but no good.
(?!\b(\w+)-+\1\b)
You can use
\b(\w+)-+(?!\1\b)\w+
See the regex demo. Details:
\b - a word boundary
(\w+) - Group 1: one or more word chars
-+ - one or more hyphens
(?!\1\b)\w+ - one or more word chars that are not equal to the first capturing group value.

Working with regex for alphanumeric

I'm trying a regex fro Alpha Numeric of length 7 (with positions 1,3,4 as characters and positions 2,5,6,7 as digits).
[a-zA-Z]|[0-9]|[a-zA-Z]|[a-zA-Z]|[0-9]|[0-9]|[0-9]
Can someone help me?
The sequence "character, digit, character, character, digit, digit, digit" is expressed in regex as
[a-zA-Z][0-9][a-zA-Z]{2}[0-9]{3}
If you're working in PCRE (with say, PHP):
^([a-zA-Z])([0-9])(?1){2}(?2){3}$
Breakdown:
^ - from the start of the string
([a-zA-Z]) - match and capture a single character in the ranges given: a-z, A-Z
([0-9]) - match and capture a single character in the ranges given: 0-9
(?1){2} - redo the regex in the first group twice (recursive subpattern)
(?2){3} - redo the regex in the second group 3 times (recursive subpattern)
$ - the end of the string
If you want to match this in the middle of a sentence, exchange ^ and $ for \b - which will match a word boundary
See the demo
If you're not using PCRE:
^[a-zA-Z][0-9][a-zA-Z]{2}[0-9]{3}$
Which does the same thing, but has some copy-paste involved