My regex expression is both lazy and greedy. Why? - regex

Suppose I'm searching for anchor links in a web page. A regex that works is:
"\<a\s+.*?\>"
However, lets add a complication. Lets suppose that I only want links which surround specific text, for instance, the word 'next'. Normally, I would think all I had to do is:
"\<a\s+.*?\>next"
But I find that now, if there are 3 anchor tags in a page, and the third one has 'next' after it, that the regex search finds a huge string extending from the first anchor tag, and extending to the third anchor tag. This makes sense if the period-asterisk-questionmark is finding all characters until it comes across ">next". But that is not what I want. I want to find all characters until it comes across ">", and then an additional constraint should be that right after the ">" there should be "next".
How do I get this to work?

You can fix your regex by prohibiting it from matching > inside the tag, i.e. by replacing . with [^>]:
"\<a\s+[^>]*?\>next"
.*? matches any number of characters. The fact that you made it reluctant does not make it stop at >: it continues matching past it, until it finds >next at the end. This is not greedy, because the expression matched as little as possible to obtain a match. It's just that no shorter matches were available.
Demo.

Related

Sigil editor: Regex string to look for a (hyphen) character in text, but not html attributes

My problem:
I use Sigil to edit xhtml files of an ebook.
When exporting from InDesign to ePub I tick option to remove forced line breaks.
This act removes all - hyphen characters which are auto-generated by InDesign, but the characters which were added manually during my word-break fine-tune remain in the text.
Current ability of Sigil search: searching by - parses everything, including css class names.
TODO: How to construct regex query which finds the - within the text, but not in the html code?
Thank you!
What I have already tried: https://www.mobileread.com/forums/showpost.php?p=4099971&postcount=169:
Here is a simple example to find the word "title" not inside a tag itself, here is the simplest regex search I could think of off the top of my head. It assumes there is no bare text in the body tag and that the xhtml is well formed.
I tried it and it appears to work. There are probably better more exhaustive regex, that can handle even broken xhtml.
Code:
title(?=[^>]*<)
This basically says search for "title" but lookahead to make sure there are no closing tag chars ">" before you find the next opening tag char "<".
There are probably look behind versions that could work with reverse logic. And there are ways to use regex to find a two strings that ignores any intervening tags.
Give it a try. You could add a saved search easily to do that. But again it will not handle find and replacement of text that crosses over elements (over nodes in the tree). That is the hard part unless you have one to one corresponding matching of matching substrings to replacement substrings which in general need not be the case.
And of course if you use < and > inside strings to show a "tag" or code snippet, these would be found by mistake so reviewing each find before the replace would be needed.
In Sigil, PCRE regex engine is used.
Thus, you can use
<[^<>]*>(*SKIP)(*F)|-
See the regex demo.
Details:
<[^<>]*>(*SKIP)(*F) - matches <, zero or more chars other than < and > and then a >, and then skips the match and goes on to search for the next match from the position where the failure occurred
| - or
- - a hyphen.
NOTE: you might want to match any dashes with [\p{Pd}\x{00AD}] (to replace with -).

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

Using Flags of Regex within Google Forms

I'm trying to use flags within Google Forms, and I've been googling hoping to find an answer in the last couple of hours, but didn't find any. Google Forms say that the regular expression is not valid. Even when I use a simple regex such as: (?i)t. I'm trying to use the regex inside a paragraph question.
How can I make it work?
Edit:
What I really need is to match [a-zA-Z" ]+( *),( *)[1-9]([0-9]??)\n repeatedly, so each line will look something like: Sam "The Man" McAdams , 9\n. Of course, the number of lines is unknown. using the repetition modifiers of * or + at the end of the regex does not satisfy my needs, because if the first line is accepted as valid, the other lines might be composed of anything really, and it considers it as a valid input, while it's not.
You can use the following expression to validate an entire string that only consists of lines meeting your pattern:
^([a-zA-Z" ]+ *, *[1-9][0-9]?(\n|$))+$
See the regex demo.
The main point is to add an alternation group to match either a newline or the end of string ((\n|$)) and wrap the whole pattern into a +-quantified group ((...)+) anchored at both start (^) and end ($).

Notepad++ regex group capture

I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?
This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.
The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....

Remove everything before and after variable=int

I'm terrible at regex and need to remove everything from a large portion of text except for a certain variable declaration that occurs numerous times, id like to remove everything except for instances of mc_gross=anyint.
Generally we'd need to use "negative lookarounds" to find everything but a specified string. But these are fairly inefficient (although that's probably of little concern to you in this instance), and lookaround is not supported by all regex engines (not sure about notepad++, and even then probably depends on the version you're using).
If you're interested in learning about that approach, refer to How to negate specific word in regex?
But regardless, since you are using notepad++, I'd recommend selecting your target, then inverting the selection.
This will select each instance, allowing for optional white space either side of the '=' sign.
mc_gross\s*=\s*\d+
The following answer over on super user explains how to use bookmarks in notepad++ to achieve the "inverse selection":
https://superuser.com/questions/290247/how-to-delete-all-line-except-lines-containing-a-word-i-need
Substitute the regex they're using over there, with the one above.
You could do a regular expression replace of ^.*\b(mc_gross\s*=\s*\d+)\b.*$ with \1. That will remove everything other than the wanted text on each line. Note that on lines where the wanted text occurs two or more times, only one occurrence will be retained. In the search the ^.*\b matches from start-of-line to a word boundary before the wanted text; the \b.*$ matches everything from a word boundary after the wanted text until end of line; the round brackets capture the wanted text for the replacement text. If text such as abcmc_gross=13def should be matched and retained as mc_gross=13 then delete the \bs from the search.
To remove unwanted lines do a regular expression search for ^mc_gross\s*=\s*\d+$ from the Mark tab, tick Bookmark line and click Mark all. Then use Menu => Search => Bookmark => Remove unmarked lines.
Find what: [\s\S]*?(mc_gross=\d+|\Z)
Replace with: \1
Position the cursor at the start of the text then Replace All.
Add word boundaries \b around mc_gross=\d+ if you think it's necessary.