Regex group re-structuring - regex

I have input data as following as:
AB
AB_Test
AB_Test123
Expected output:
AB
AB_Test
I have regex which matches:
([A-Z]{2})(_([a-zA-Z]*))?
Debuggex Demo
So from above regex, I get the strings in $1 & $3. I want to modify the regex such that the strings will be in $1 & $2 (omit group 2 from above regex).
Now I want to process the matched strings using the group. That is why I want it to be in sequence.
Is there are way by which we can change the above regex?

If you are trying to eliminate the strings that have digits, use a negative lookahead assertion:
^([A-Z]{2})(?!.*\d.*)(?:_([A-Za-z]*))?
Demo
Or add anchors on both ends of the string:
^([A-Z]{2})(?:_([A-Za-z]*))?$
Demo 2
If you use anchors, you will need to use the m flag if you have a multiline target.

You need to use a non-capturing group for the one you don't want:
^([A-Z]{2})(?:_([a-zA-Z]*))?$
EDIT: I also added beginning and ending anchors because it seems that you want to match the line only if the characters after the underscore are all letters.

Related

regex to negate from matched group

I am trying to use regex to match anything but "id":digits part
I have come up with this "(\b(id":)(\d+)\b)" to find the id:byDigits pattern, but I need to negate that but haven't been able to get around it.
[{"age":1,"id":123,"value":"14"},
{"age":1,"id":4214,"value":"4324"},
{"age":3,"id":4244,"value":"545"}]
Any help is appreciated.
Simplest option is to capture the rest of the string into groups and use it in the substituion as below
Demo: https://regex101.com/r/cRVA5C/2/
Pattern: ^([\s\S]*?)\s*"id":\d+,?\s*([\s\S]*?)$
Breakdown:
([\s\S]*?): match any number of any characters before and after "id":. Capture it into groups \1 and \2
\s*"id":\d+,?\s*: match "id"=\d+, optionally preceded by spaces and optionally followed by spaces and ,.
In substituition, use \1\2, to get the desired output.
Note: Regex may not be the ideal tool for parsing JSON.

Regex extract string between 2 strings, that contains 3rd string

I have this regex
(?<=TG00).*?(?=#)
which extracts all strings between TG00 and #. Demo: https://regex101.com/r/04oqua/1
Now, from above results I want to extract only the string which contains TG40 155963. How can I do it?
Try this pattern:
TG00[^#]*TG40 155963[^#]*#
This pattern just says to find the string TG40 155963 in between TG00 and an ending #. For the sample data in your demo there were 3 matches.
Demo
For some reason appending .*? to your lookbehind results in engine error, but works fine with lookahead. Regex below does not match your text exactly, but it does extract it via capture group.
(?<=TG00).*?(TG40 155963)(?=.*?#)
You can use this regex with a lookahead and negated character class:
(?<=TG00)(?=[^#]*TG40 155963)[^#]+(?=#)
RegEx Demo
RegEx Explanation:
(?<=TG00): Assert that we have TG00 at previous position
(?=[^#]*TG40 155963): Lookahead to assert we have string TG40 155963 after 0 or more non-# characters, ahead
[^#]+: Match 1+ non-# characters

Negative lookahead with capturing groups

I'm attempting this challenge:
https://regex.alf.nu/4
I want to match all strings that don't contain an ABBA pattern.
Match:
aesthophysiology
amphimictical
baruria
calomorphic
Don't Match
anallagmatic
bassarisk
chorioallantois
coccomyces
abba
Firstly, I have a regex to determine the ABBA pattern.
(\w)(\w)\2\1
Next I want to match strings that don't contain that pattern:
^((?!(\w)(\w)\2\1).)*$
However this matches everything.
If I simplify this by specifying a literal for the negative lookahead:
^((?!agm).)*$
The the regex does not match the string "anallagmatic", which is the desired behaviour.
So it looks like the issue is with me using capturing groups and back-references within the negative lookahead.
^(?!.*(.)(.)\2\1).+$
^^
You can use a lookahead here.See demo.The lookahead you created was correct but you need add .* so that it cannot appear anywhere in the string.
https://regex101.com/r/vV1wW6/39
Your approach will also work if you make the first group non capturing.
^(?:(?!(\w)(\w)\2\1).)*$
^^
See demo.It was not working because \2 \1 were different than what you intended.In your regex they should have been \3 and \2.
https://regex101.com/r/vV1wW6/40

Regex Greediness

I have a perl regex that i'm fairly certain should work (perl) but is being too greedy:
regex:
(?:.*serial[^\d]+?(\d+).*)
Test string:
APPLICATIONSERIALNO123456Plnsn123456te20140728tdrnserialnun12hou
Desired group 1 match:
123456
Actual group 1 Match:
12
I've tried every permutation of lookahead and behind and laziness and I can't get the damn thing to work.
WHAT AM I MISSING.
Thanks!
The Problem is Not Greediness, but Case-Sensitivity
Currently your regex matches the 12 at the end of serialnun12, probably because it is case-sensitive. We have two options: using upper-case, or making the pattern case-insensitive.
Option 1: Use Upper-Case
If you only want 123456, you can use:
SERIALNO\K\d+
The \K tells the engine to drop what was matched so far from the final match it returns.
If you want to match the whole string and capture 123456 to Group 1, use:
.*?SERIAL\D+(\d+).*
Option 2: Turning Case-Sensitivity On using (?i) inline or the i flag
To only match 123456, you can use:
(?i)serial\D+\K\d+
Note that if you use the g flag, this would match both numbers.
If you want to match the whole string and capture 123456 to Group 1, use:
(?i).*?serial\D+(\d+).*
A few tips
You can turn case-insensitivity either with the (?i) inline modifier or the i flag at the end of the pattern: /serial\D+\K\d+/i
Instead of [^\d], use \D
There is no need for a lazy quantifier in something like \D+\d+ because the two tokens are mutually exclusive: there is no danger that the \D will run over the \d
The problem is not greediness; it's case-sensitivity.
Currently your regex matches the 12 at the end of serialnun12 because those are the only digits following serial. The ones you want follow SERIAL. S and s are different characters.
There are two solution.
Use the uppercase characters in the pattern.
my ($serial) = $string =~ /SERIAL\D*(\d+)/;
Use case-insensitive matching.
my ($serial) = $string =~ /serial\D*(\d+)/i;
There's probably no need for this, but I thought I'd mention it just in case.

Matching on repeated substrings in a regex

Is it possible for a regex to match based on other parts of the same regex?
For example, how would I match lines that begins and end with the same sequence of 3 characters, regardless of what the characters are?
Matches:
abcabc
xyz abc xyz
Doesn't Match:
abc123
Undefined: (Can match or not, whichever is easiest)
ababa
a
Ideally, I'd like something in the perl regex flavor. If that's not possible, I'd be interested to know if there are any flavors that can do it.
Use capture groups and backreferences.
/^(.{3}).*\1$/
The \1 refers back to whatever is matched by the contents of the first capture group (the contents of the ()). Regexes in most languages allow something like this.
You need backreferences. The idea is to use a capturing group for the first bit, and then refer back to it when you're trying to match the last bit. Here's an example of matching a pair of HTML start and end tags (from the link given earlier):
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.
Applying this to your case:
/^(.{3}).*\1$/
(Yes, that's the regex that Brian Carper posted. There just aren't that many ways to do this.)
A detailed explanation for posterity's sake (please don't be insulted if it's beneath you):
^ matches the start of the line.
(.{3}) grabs three characters of any type and saves them in a group for later reference.
.* matches anything for as long as possible. (You don't care what's in the middle of the line.)
\1 matches the group that was captured in step 2.
$ matches the end of the line.
For the same characters at the beginning and end:
/^(.{3}).*\1$/
This is a backreference.
This works:
my $test = 'abcabc';
print $test =~ m/^([a-z]{3}).*(\1)$/;
For matching the beginning and the end you should add ^ and $ anchors.