Why does this regular expression match as few characters as possible? - regex

I can match an 'a' followed by at least 2 other characters before another 'a' with the following regular expression.
a.{2,}?a
Interestingly, including the question mark makes the regex match the instance with the fewest number of middle characters possible, so for instance, given the following string,
abbabbbba
the regex will match the leftmost abba instead of the whole string. Why does including the question mark cause the regex to match the instance with the fewest number of middle characters?

The question mark after a quantifier makes the quantifier lazy. It is a basic feature of regex, you need to learn more about it.
a link: regular-expressions.info
(?:or|and) the one in hwnd comment.

? implies a lazy match
here is the details of your regex
/a.{2,}?a/
a matches the character a literally (case sensitive)
. matches any character (except newline)
{2,} Quantifier: Between 2 and unlimited times
? as few times as possible, expanding as needed [lazy]
a matches the character a literally (case sensitive)

Related

RegEx: don't capture match, but capture after match

There are a thousand regular expression questions on SO, so I apologize if this is already covered. I did look first.
I have string:
Name Subname 11X22 88X620 AB33(20) YA5619 77,66
I need to capture this string: YA5619
What I am doing is just finding AB33(20) and after this I am capturing until first white space. But AB33(20) can be AB-33(20) or AB33(-20) or AB33(-1).
My preg_match regex is: (?<=\bAB\d{2}\(\d{2}\)\s).+?(?=\s)
Why I am getting error when I change from \d{2} to \d+?
For final result I was thinking this regix will work but no:
(?<=\bAB-?\d+\(-?\d+\)\s).+?(?=\s)
Any ideas what I am doing wrong?
With most regex flavors, lookbehind needs to evaluate to a fixed-length sequence, so you can't use variable quantifiers like * or + or even {1,2}.
Instead of using lookaround, you can simply match your marker pattern and then forget it with \K.
AB-?\d+(?:\(-?\d+\))? \K[^ ]+
demo: https://regex101.com/r/8XXngH/1
It depends on the language. If it is in .NET for example, it matches due to the various length in the lookbehind.
Another solution might be to use a character class and add the character you would allow to match. Then match a whitespace character and capture in a group matching \S+ which matches 1+ times not a whitespace character.
\bAB[()\d-]+\s\K\S+
Explanation
\bAB Match literally prepended with word boundary to prevent AB being part of a larger match.
[()\d-]+ Match 1+ times any of the listed character in the character class
\s Match a whitespace char (or \s+ to match 1 or more)
\K Reset the starting point of the reported match( Forget what was matched)
\S+ Match in a group 1+ times not a whitespace character
Regex demo | Php demo

Simple Regex: match everything until the last dot

Just want to match every character up to but not including the last period
dog.jpg -> dog
abc123.jpg.jpg -> abc123.jpg
I have tried
(.+?)\.[^\.]+$
Use lookahead to assert the last dot character:
.*(?=\.)
Live demo.
This will do the trick
(.*)\.
Regex Demo
The first captured group contains the name. You can access it as $1 or \1 as per your language
Regular expressions are greedy by default. This means that when a regex pattern is capable of matching more characters, it will match more characters.
This is a good thing, in your case. All you need to do is match characters and then a dot:
.*\.
That is,
. # Match "any" character
* # Do the previous thing (.) zero OR MORE times (any number of times)
\ # Escape the next character - treat it as a plain old character
. # Escaped, just means "a dot".
So: being greedy by default, match any character AS MANY TIMES AS YOU CAN (because greedy) and then a literal dot.

Regex Matching Behaviour Of \w

I noticed some interesting behaviour with some regex work I am doing, and I'd like some insight.
From what I understand, the word character, \w should match the following [a-zA-Z_0-9]
Given this input,
0000000060399301+0000000042456971+0000000
What should this regex
(\d+)\w
Capture?
I would expect it to capture 0000000060399301 but it actually captures 000000006039930
Is there something I am missing? Why is the 1 dropped from the end?
I noticed if I changed the regex to
(\d+\w)
It captures correctly i.e. including the 1
Anyone care to explain? Thanks
You require the regex to match a trailing word character - that would be the 1.
It cannot be another character, because
+ is not a word class character
+ is not a digit
matching is greedy
\d+ - matches one or more digit characters.
\w+ - matches one or more word characters. [A-Za-z\d_]
So with this string 0000000060399301+, \d+ in this (\d+)\w regex matches all the digits (including the 1 before +) at very first, since the following pattern is \w , regex engine tries to find a match, so it backtracks one character to the left and forces \w to match the digit before + . Now the captured group contains 000000006039930 and the last 1 is matched by \w
The 1 is being dropped because \w isn't in the capture group.

How would I detect superscript for one word if there's no parentheses, but if there are parentheses, for all the contents of them?

I want to detect the two following circumstances, preferably with one regex:
This is a sentence ^that I wrote today.
And:
This is a sentence ^(that I wrote) today.
So basically, if there are parentheses after the caret, I want to match whatever is inside them. Otherwise, I just want to match just the next word.
I'm new to regex. Is this possible without making it too complicated?
\^(\w+|\([\w ]+\))
Options: case insensitive; ^ and $ match at line breaks
Match the character “^” literally «\^»
Match the regular expression below and capture its match into backreference number 1 «(\w+|\([\w ]+\))»
Match either the regular expression below (attempting the next alternative only if this one fails) «\w+»
Match a single character that is a “word character” (letters, digits, etc.) «\w+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «\([\w ]+\)»
Match the character “(” literally «\(»
Match a single character present in the list below «[\w ]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A word character (letters, digits, etc.) «\w»
The character “ ” « »
Match the character “)” literally «\)»
Created with RegexBuddy

Decode the regexp string that matches the word in string

I have the following regexp
var value = "hello";
"(?<start>.*?\W*?)(?<term>" + Regex.Escape(value) + #")(?<end>\W.*?)"
I'm trying to figure out the meaning, because it doesnt work against the single word.
for example, it matches "they said hello us", but fails for just "hello"
can you please help me to decode what does this regexp string mean?!
PS: it's .NET regexp
Its because of \W in last part. \W is non A-Z0-9_ char.
In "they said hello us", there is space after hello, but "hello" there is nothing there, thats why.
If you change it to (?<end>\W*.*?) it may work.
Actually, the regex itself does not make sense for me, it should rather like
"\b" + Regex.Escape(value) + "\b"
\b is word boundary
The regex may be trying to find a pattern comprising whole words, so that your hello example doesn't match, say, Othello. If so, the word boundary regex, \b, is tailor-made for the purpose:
#"\b(" + Regex.Escape(value) + #")\b"
if this is .NET regex and the Regex.escape() part is replaced with just 'hello' .. Regex Buddy says it means:
(?<start>.*?\W*?)(?<term>hello)(?<end>\W.*?)
Options: case insensitive
Match the regular expression below and capture its match into backreference with name “start” «(?<start>.*?\W*?)»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match a single character that is a “non-word character” «\W*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference with name “term” «(?<term>hello)»
Match the characters “hello” literally «hello»
Match the regular expression below and capture its match into backreference with name “end” «(?<end>\W.*?)»
Match a single character that is a “non-word character” «\W»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»