This question already has answers here:
Greedy vs. Reluctant vs. Possessive Qualifiers
(7 answers)
Closed 8 years ago.
I want to fetch a certain html node in a large html text, but something in my regex is bad.
I want to fetch all urls that look like this:
some stuff
I am trying to do:
/<a href="ftp:(.+)">/
but sometimes it will work, but sometimes it will grab everything until the next close >.
Is there a way to rewrite this regex so it will stop at the first >?
+ is a greedy operator meaning it matches as much as it possibly can and still allows the rest of the regex to match. For this, I recommend using a negated class meaning any character except: " "one or more" times.
/<a href="ftp:([^"]+)">/
Live Demo
Make your regex ungreedy:
/<a href="ftp:(.+?)">/
// here __^
or:
/<a href="ftp:([^>"]+)">/
But it's better to use a parser.
*, + are greey (matches as much as possible). By appending ? after them, you can make non-greedy.
/<a href="ftp:(.+?)">/
or you can specify exclude " using negated character classes ([^...]):
/<a href="ftp:([^"]+)">/
BTW, it's not a good idea to use regular expression to parse HTML.
Related
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I have the following example of key=value pairs as one line string
start=("a", "b") and between=("range(2019, max, to=\"le\")") and end=("a", "b")
Using regex in golang I want to extract the key=value pairs as below
start=("a", "b")
between=("range(2019, max, to=\"le\")")
end=("a", "b")
There are solutions on stackoverflow but do not work with golang regex.
There is a link to my failed attempt with golang regex: regex101 golang flavor
I would appreciate any help.
The problem is the escaped quotes:
\S+=(\([^(]*(?:[^("]*"(?:[^\\"]|\\["\\])*")(\)))
https://regex101.com/r/3ytO9P/1
I changed [^"] to (?:[^\\"]|\\["\\]). This makes the regex look for either a regular character or an escape. By matching the escape, it doesn’t allow \" to end the match.
Your regex has other problems though. This should work better:
\S+=(\([^("]*(?:[^("]*"(?:[^\\"]|\\["\\])*")*(\)))
https://regex101.com/r/OuDvyX/1
It changes [^(] to [^("] to prevent " from being matched unless it’s part of a complete string.
UPDATE:
#Wiktor Stribiżew commented below:
It still does not support other escape sequences. The first [^("]* is redundant in the current pattern. It won't match between=("a",,,) but will match between=("a",,",") - this is inconsistent. The right regex will match valid double quoted string literals separated with commas and any amount of whitespace between them. The \S+=(\([^(]*(?:[^("]*"(?:[^\\"]|\\["\\])*")(\))) is not the right pattern IMHO
If you really want the regex to be that robust, you should use a parser, but you could fix those problems by using:
\S+=(\((?:[^("]*"(?:[^\\"]|\\.)*"[^("]*)*(\)))
This question already has answers here:
How to match only strings that do not contain a dot (using regular expressions)
(3 answers)
Closed 3 years ago.
I have numerous SELECT statements conjoined by UNION keyword in a single file. What I want to do is to extract all the db.table strings only? How can I delete all words not containing period (.) using regex in notepad++ editor? Database and table are the only ones with a period.
It's okay with me even if new lines are not removed. Though, as a learning bonus for everyone seeing this post, you can also show the regex that trims the new lines, that will show this output:
db.table1
db.table2
...
db.tablen
You may try the following find and replace, in regex mode:
Find: (?<=^|\s)[^.]+(?=$|\s)
Replace: <empty string>
Demo
Note that my replacement only removes the undesired terms in the query; it does not make an effort to remove stray or leftover whitespace. To do that, you can easily do a quick second replacement to remove whitespace you don't want.
Edit:
It appears that Notepad++ doesn't like the variable width lookbehinds I used in the pattern. Here is a refactored, and more verbose version, which uses strictly fixed width lookbehinds:
(^[^.]+$)|(^[^.]+(?=\s))|((?<=\s)[^.]+$)|((?<=\s)[^.]+(?=\s))
Demo
The logic in both of the above patterns is to match a word consisting entirely of non dot characters, which are surrounded on either side by one or more of the following:
start of the string (^)
end of the string ($)
any type of whitespace (\s)
My guess is that maybe this expression:
([\s\S]*?)(\S*(\.)\S*)
being replaced with $2\n or:
(\S*(\.)\S*)|(.+?)
with $1 might work.
Demo 1
Demo 2
This question already has answers here:
Regular expressions: Ensuring b doesn't come between a and c
(4 answers)
Closed 4 years ago.
The following regular expression is jumping [url] tags...
Regular Expression (generic regular expression)
(?:\[url.*?\])(.*?youtu.*?)(?:\[\/url\])
String:
[url]blahyoutubeblah[/url] heyya [url]blahblah[/url] [url]www.youtube.com/blah[/url]
Help!!
Your captured group requires youtu inside, so the substring
[url]blahblah[/url] [url]www.youtube.com/blah[/url]
matches, because it starts with [url], includes youtu, and ends with [/url].
Simply using a negated character set, excluding [, probably isn't enough, because that wouldn't allow for nested tags to match, such as an input of
[url]foobar youtube[b]BOLD TEXT[/b][/url]
You might require negative lookahead for [/url] right before each repeated character:
(?:(?!\[\/url\]).)*
Also, make sure that whatever comes after the [url does not contain ]s before coming to the true ], with:
\[url[^]]*\]
In full:
\[url[^]]*\]((?:(?!\[\/url\]).)*youtu(?:(?!\[\/url\]).)*)\[\/url\]
There's no need to make the quantifiers lazy anymore, because of the negative lookahead.
Demo:
https://regex101.com/r/hSAJEp/1
You are matching .* which means it will match url, up until youtu, then find /url
A simple workaround could be something like which means it won't match a opening [ bracket before finding youtu
(?:\[url.*?\])([^\[]*?youtu.*?)(?:\[\/url\])
The problem was that there is youtu you had in your regex but there was blahblah between url to be matched, making it generic
so
(?:\[url.*?\])(.*?)(?:\[\/url\])
It's lazy, but it still will match if it can - it won't be moving left border if match is possible. There are other things to do that. One of them is just to prevent unwanted match by regex itself - just use
(?:\[url[^\]]*?\])([^\[]*?youtu.*?)(?:\[\/url\])
This question already has answers here:
Regex plus vs star difference? [duplicate]
(9 answers)
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 4 years ago.
I am trying to match a string "menu-item" but has a digit after it.
<li id="menu-item-578" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-578">
i can use this regex
menu-item-[0-9]*
however it matches all the menu-item string, i want to only match the "menu-item-578" but not id="menu-item-578"
how can i do it?
thank you
You should avoid using menu-item-[0-9]* not because it matches the same expected substring superfluously but for the reason that it goes beyond that too like matching menu-item- in menu-item-one.
Besides replacing quantifier with +, you have to look if preceding character is not a non-whitespace character:
(?<!\S)menu-item-[0-9]+(?=["' ])
or if your regex flavor doesn't support lookarounds you may want to do this which may not be precise either:
[ ]menu-item-[0-9]+
You may also consider following characters using a more strict pattern:
[ ]menu-item-[0-9]+["' ]
Try it works too:
(\s)(menu-item-)\d+
https://regex101.com/
\s Any whitespace character
Use a space before, like this:
\ menu-item-[0-9]*
The first ocurrence has an " right before, while the second one has a space.
EDIT: use an online regex editor (like Regex tester to try this things.
This question already has answers here:
Using alternation or character class for single character matching?
(3 answers)
Closed 4 years ago.
What's the difference between regex
[-+]?
and
(-|+)?
Don't they mean the same?
Both match same characters. But the second form produce capturing group. You can use backreference to access the group (\1 or $1, .. according to your regular expression engine).
UPDATE
The second form is invalid in many regular expression engines. (valid for some old regular expression engine that match + match literally).
Because + has special meaning: One or more repetitions of preceding pattern, but there's nothing to repeat.
They are same but I would prefer character class (1st form) since 2nd form captures - or + which you may not need.
Even this will be equivalent without capturing the text in the group:
(?:-|+)?
Most regexes can be put in the form of alternation groups and the star operator - for example, [ab]+ can be written as (a|b)(a|b)* - but this is much more verbose, so the other operators exist. You included the question mark operator in your regex, but really [-+]? is equivalent to (+|-|)
So there really is no difference (except for capturing as others have mentioned), but that doesn't mean the other operators aren't useful in making a regex compact and intuitive to understand.