Regex Greediness

Regex Greediness - regex

I have a perl regex that i'm fairly certain should work (perl) but is being too greedy:
regex:
(?:.*serial[^\d]+?(\d+).*)
Test string:
APPLICATIONSERIALNO123456Plnsn123456te20140728tdrnserialnun12hou
Desired group 1 match:
123456
Actual group 1 Match:
12
I've tried every permutation of lookahead and behind and laziness and I can't get the damn thing to work.
WHAT AM I MISSING.
Thanks!

The Problem is Not Greediness, but Case-Sensitivity
Currently your regex matches the 12 at the end of serialnun12, probably because it is case-sensitive. We have two options: using upper-case, or making the pattern case-insensitive.
Option 1: Use Upper-Case
If you only want 123456, you can use:
SERIALNO\K\d+
The \K tells the engine to drop what was matched so far from the final match it returns.
If you want to match the whole string and capture 123456 to Group 1, use:
.*?SERIAL\D+(\d+).*
Option 2: Turning Case-Sensitivity On using (?i) inline or the i flag
To only match 123456, you can use:
(?i)serial\D+\K\d+
Note that if you use the g flag, this would match both numbers.
If you want to match the whole string and capture 123456 to Group 1, use:
(?i).*?serial\D+(\d+).*
A few tips
You can turn case-insensitivity either with the (?i) inline modifier or the i flag at the end of the pattern: /serial\D+\K\d+/i
Instead of [^\d], use \D
There is no need for a lazy quantifier in something like \D+\d+ because the two tokens are mutually exclusive: there is no danger that the \D will run over the \d

The problem is not greediness; it's case-sensitivity.
Currently your regex matches the 12 at the end of serialnun12 because those are the only digits following serial. The ones you want follow SERIAL. S and s are different characters.
There are two solution.
Use the uppercase characters in the pattern.
my ($serial) = $string =~ /SERIAL\D*(\d+)/;
Use case-insensitive matching.
my ($serial) = $string =~ /serial\D*(\d+)/i;
There's probably no need for this, but I thought I'd mention it just in case.

Related

How to match with regexp any occurence of a specific char within a string delimited by specific delimiters? [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.

You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).

location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c

How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.

Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/

Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.

Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?

The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.

Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/

import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.

The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.

If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre

-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

Regex group re-structuring

I have input data as following as:
AB
AB_Test
AB_Test123
Expected output:
AB
AB_Test
I have regex which matches:
([A-Z]{2})(_([a-zA-Z]*))?
Debuggex Demo
So from above regex, I get the strings in $1 & $3. I want to modify the regex such that the strings will be in $1 & $2 (omit group 2 from above regex).
Now I want to process the matched strings using the group. That is why I want it to be in sequence.
Is there are way by which we can change the above regex?

If you are trying to eliminate the strings that have digits, use a negative lookahead assertion:
^([A-Z]{2})(?!.*\d.*)(?:_([A-Za-z]*))?
Demo
Or add anchors on both ends of the string:
^([A-Z]{2})(?:_([A-Za-z]*))?$
Demo 2
If you use anchors, you will need to use the m flag if you have a multiline target.

You need to use a non-capturing group for the one you don't want:
^([A-Z]{2})(?:_([a-zA-Z]*))?$
EDIT: I also added beginning and ending anchors because it seems that you want to match the line only if the characters after the underscore are all letters.

Regex Matching with Space

I had a very simple question about regex matching, I want have "string" (ignore case) matched
in this case: "thisisastring", nothing should be returned
in this case: "this is a string" a single match on "string" should be returned
Now I had #"([S|s][T|t][R|r][I|i][N|n][G|g])" as the regex, However it doesn't work correctly in the first case.
How should I write this regex?
Thanks in advance!

[S|s] does not match what you seem to think
Please note that [S|s] does not mean "match a S or a s". It means "match one character that is either a S, a | or a s". That's how things work inside a [character class]. To express an OR, you can use a non-capturing group: (?:S|s). But [Ss] is all you need, and case-insensitivity is even better.
Case-Insensitivity
I'm going to assume we're using case-insensitive mode so we end up with a simpler regex. I assume you're in C# as it looks like you're using a verbatim string: (?i) will work. Another way to set case-insensitivity in C# would be RegexOptions.IgnoreCase
Option 1: boundary (close but no cigar)
(?i)\bstring
This no longer matches string in astring. However, it matches string in ##string, which you do not want.
Option 2: lookbehind
(?i)(?<=[ ])string
The lookbehind ensures that string is preceded by a space character. The brackets are optional, they help see the space.
Option 3: \K (but not in C#)
For engines that support it (Perl, PCRE, Ruby 2+):
(?i)[ ]\Kstring
The \K tells the engine to drop what was matched so far from the final match it returns

Regular expression to match phone number?

I want to match a phone number that can have letters and an optional hyphen:
This is valid: 333-WELL
This is also valid: 4URGENT
In other words, there can be at most one hyphen but if there is no hyphen, there can be at most seven 0-9 or A-Z characters.
I dont know how to do and "if statement" in a regex. Is that even possible?

I think this should do it:
/^[a-zA-Z0-9]{3}-?[a-zA-Z0-9]{4}$/
It matches 3 letters or numbers followed by an optional hyphen followed by 4 letters or numbers. This one works in ruby. Depending on the regex engine you're using you may need to alter it slightly.

You seek the alternation operator, indicated with pipe character: |
However, you may need either 7 alternatives (1 for each hyphen location + 1 for no hyphen), or you may require the hyphen between 3rd and 4th character and use 2 alternatives.
One use of alternation operator defines two alternatives, as in:
({3,3}[0-9A-Za-z]-{4,4}[0-9A-Za-z]|{7,7}[0-9A-Za-z])

Not sure if this counts, but I'd break it into two regexes:
#!/usr/bin/perl
use strict;
use warnings;
my $text = '333-URGE';
print "Format OK\n" if $text =~ m/^[\dA-Z]{1,6}-?[\dA-Z]{1,6}$/;
print "Length OK\n" if $text =~ m/^(?:[\dA-Z]{7}|[\dA-Z-]{8})$/;
This should avoid accepting multiple dashes, dashes in the wrong place, etc...

Supposing that you want to allow the hyphen to be anywhere, lookarounds will be of use to you. Something like this:
^([A-Z0-9]{7}|(?=^[^-]+-[^-]+$)[A-Z0-9-]{8})$
There are two main parts to this pattern: [A-Z0-9]{7} to match a hyphen-free string and (?=^[^-]+-[^-]+$)[A-Z0-9-]{8} to match a hyphenated string.
The (?=^[^-]+-[^-]+$) will match for any string with a SINGLE hyphen in it (and the hyphen isn't the first or last character), then the [A-Z0-9-]{8} part will count the characters and make sure they are all valid.

Thank you Heath Hunnicutt for his alternation operator answer as well as showing me an example.
Based on his advice, here's my answer:
[A-Z0-9]{7}|[A-Z0-9][A-Z0-9-]{7}
Note: I tested my regex here. (Just including this for reference)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex Greediness - regex

Related

How to match with regexp any occurence of a specific char within a string delimited by specific delimiters? [duplicate]

How can I get the second part of a hyphenated word using regex?

Regex group re-structuring

Regex Matching with Space

Regular expression to match phone number?

Categories

Resources