Exclude match results (pcre regex) - regex

So I've this regex (from https://github.com/savetheinternet/Tinyboard/blob/master/inc/functions.php#L1620)
((?:https?:\/\/|ftp:\/\/|irc:\/\/)[^\s<>()"]+?(?:\([^\s<>()"]*?\)[^\s<>()"]*?)*)((?:\s|<|>|"|\.|\]|!|\?|,|,|")*(?:[\s<>()"]|$))
it works for matching links like: http://stackoverflow.com/ etc..
question is, how I can exclude these kind of markup matches (mainly the url ja img parts):
[url]http://stackoverflow.com/[/url]
[url=http://stackoverflow.com/]http://stackoverflow.com/[/url]
[img]http://cdn.sstatic.net/stackoverflow/img/sprites.png[/img]
[img=http://cdn.sstatic.net/stackoverflow/img/sprites.png]

To exclude this you can add at the begining of your expression this subpattern:
(?:\[(url|img)](?>[^[]++|[(?!\/\g{-1}))*+\[\/\g{-1}]|\[(?:url|img)=[^]]*+])(*SKIP)(*FAIL)|your pattern here
The goal of this is to try to match the parts you don't want before and forces the regex engine to fail with the backtracking control verb (*FAIL). The (*SKIP) verb forces the regex engine to not retry the substring matched before when the subpattern fails after.
You can find more informations about these features here.
Notice: assuming that you are using PHP for this pattern, you can improve a little bit this very long pattern by replacing the default delimiter / by ~ to avoid to escape all / in the pattern and by using the verbose mode (x modifier) with a Nowdoc syntax. Like this you can comment it, make it more readable and easily improve the pattern
Example:
$pattern = <<<'EOF'
~
### skipping url and img bbcodes ###
(?:
\[(url|img)] # opening bbcode tag
(?>[^[]++|[(?!/\g{-1}))*+ # possible content between tags
\[/\g{-1}] # closing bbcode tag
|
\[(?:url|img)= [^]]*+ ] # self closing bbcode tags
)(*SKIP)(*FAIL) # forces to fail and skip
| # OR
### a link ###
(
(?:https?|ftp|irc):// # protocol
[^\s<>()"]+?
(?:
\( [^\s<>()"]*? \) # part between parenthesis
[^\s<>()"]*?
)*
)
(
(?:[]\s<>".!?,]|,|")*
(?:[\s<>()"]|$)
)
~x
EOF;

You could solve it with negative look-behind assertion.
(?<!pattern)
In your case, you can check if there is no ] or = character just before the matching link.
Below regex will make sure that exactly this doesn't happen:
(?<!(?:\=|\]))((?:https?:\/\/|ftp:\/\/|irc:\/\/)[^\s<>()"]+?(?:\([^\s<>()"]*?\)[^\s<>()"]*?)*)((?:\s|<|>|"|\.|\]|!|\?|,|,|")*(?:[\s<>()"]|$))
Note that the only part added is (?<!(?:\=|\])) right in the beginning and that it will not match a link in something like <a href=http://example.com> but your question does not specify this... so impove the question if that's expected behaviour or work it out yourself using negative look behind.

Related

select everything that does not match pattern

I'm trying to write a regex which gets everything but a specified pattern. I've been trying to use negative lookahead but whenever testing my expression, it never works.
I have files that are of this form:
(garbage info) filename (other garbage).extension
or
[garbage info] filename [other garbage].extension
For example, one of the files is [O2CXDR] report january [77012].pdf or
(XEW7CK) sales commissions (99723).xls
I'm using the regex.h library in C so I believe that it is a POSIX library.
I'm hoping on extracting "filename" and ".extension" so that I can write a script which will the files filename.extension
So far, I have a an expression to select the garbage info with the brackets and the spaces around it but I'm unable to select the rest.
\s*(\[|\().*?(\]|\))+\s*
and the negative lookahead I tried was:
.*(?!(\s*(\[|\().*?(\]|\))+\s*)).*
but it's just selecting everything in a single match.
I'm sure that I'm not understanding the lookaheads and lookbehind correctly. What do I have to do to fix my expression? Could somebody explain how they work since I'm a bit lost. Thanks!
Since you haven't specified a regex engine, I'll target a subset that can use the tags \K, \G, and \A (like PCRE).
The following uses a combination of match resets (\K), tempered greedy token, and start of match (without start of string) \G(?!\A), further explained below:
See regex in use here
Note: remove empty matches
\s*[[(].*?[])]\s*\K|\G(?!\A)(?:(?!\s*[[(].*?[])]\s*).)+
Match one of the following:
Option 1:
\s* Match any whitespace any number of times
[[(] Match either [ or (
.*? Match any character any number of times, but as few as possible (lazy matching)
[])] Match either ] or )
\s* Match any whitespace any number of times
\K Reset match - sets the given position in the regex as the new start of the match. This means that nothing preceding this tag will be captured in the overall match.
Option 2:
\G(?!\A) Match only at the starting point of the search or position of the previous successful match end, but not at the start of the string.
(?:(?!\s*[[(].*?[])]\s*).)+ Tempered greedy token matching anything more than once except the negative lookahead pattern (which is the same as the first option).
$ cat input_file
(garbage info) filename (other garbage).extension
(garbage info)filename(other garbage).extension
(garbage info)file name(other garbage).extension
[garbage info] filename [other garbage].extension
[garbage info]filename[other garbage].extension
[garbage info]file name[other garbage].extension
$ sed -re 's/^\s*(\([^\)]*\)|\[[^]]*\])\s*(.*\S)\s*(\([^\)]*\)|\[[^]]*\])(\..*)$/\2\4/' input_file
filename.extension
filename.extension
file name.extension
filename.extension
filename.extension
file name.extension
Maybe, as simple as
^(?:\(([^)]*)\)\s*([^(\r\n]*?)\s*\(([^)]*)\)|\[([^\]]*)\]\s*([^(\r\n]*?)\s*\[([^\]]*)\])\.(.*)$
we could extract those values.
Demo 1
RegEx Circuit
jex.im visualizes regular expressions:
If you don't need all of those capturing groups, we'd then simply remove those that we wouldn't want:
^(?:\([^)]*\)\s*([^(\r\n]*?)\s*\([^)]*\)|\[[^\]]*\]\s*([^(\r\n]*?)\s*\[[^\]]*\])\.(.*)$
Demo 2

Find strings not matching pattern in Notepad++ regex

I want to use Notepad++ regex to find all strings that do not match a pattern.
Sample Input Text:
{~Newline~}{~Indent,4~}{~Colour,Blue~}To be or not to be,{~Newline~}{~Indent,6~}
{~Colour,Green~}that {~StartItalic~}is{~EndItalic~} the question.{~EndDocument~}
The parts between {~ and ~} are markdown codes. Everything else is plaintext. I want to find all strings which do not have the structure of the markdown, and insert the code {~Plain~} in front of them. The result would look like this:
{~Newline~}{~Indent,4~}{~Colour,Blue~}{~Plain~}To be or not to be,{~Newline~}{~Indent,6~}{~Colour,Green~}{~Plain~}that {~StartItalic~}{~Plain~}is{~EndItalic~}{~Plain~} the question.{~EndDocument~}
The markdown syntax is open-ended, so I can't just use a list of possible codes to not process.
I could insert {~Plain~} after every ~}, then delete every {~Plain~} that's followed by {~, but that seems incredibly clunky.
I hope this works with the current version of Notepad++ (don't have it right now).
Matching with:
~}((?:[^{]|(?:{[^~]))+){~
and then replacing by
~}{~Plain~}$1{~
might work. The first group should capture everything between closing ~} and the next {~. It will also match { and } in the text, as long as they are not part of an opening tag {~.
EDIT Additional explanation, so you can modify it better:
~} end of previous tag
( start of the "interesting" group that contains text
(?: non-capturing group for +
[^{] everything except opening braces
| OR
(?:
{ opening brace followed by ...
[^~] ... some character which is not `~`
)
)+ end of non-capturing group for +, repeated 1 or more times
) end of the "interesting" group
{~ start of the next tag
Here is an interactive example: regex101 example
You need to use Negative Lookahead. This regex will match all ~} occurrences, so you can just replace them with ~}{~Plain~}:
~}(?!{~|$)
If you don't want to match the space in {~Indent,6~} {~Colour,Green~}, just use this:
~}(?!{~|$| )

RegEx: Don't match a certain character if it's inside quotes

Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.
Say I have this string:
some text <tag link="fo>o"> other text
I want to match the whole tag but if I use <[^>]+> it only matches <tag link="fo>.
How can I make sure that > inside of quotes can be ignored.
I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.
Regular Expression:
<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>
Online demo:
http://regex101.com/r/yX5xS8
Full Explanation:
I know this regex might be a headache to look at, so here is my explanation:
< # Open HTML tags
[^>]*? # Lazy Negated character class for closing HTML tag
(?: # Open Outside Non-Capture group
(?: # Open Inside Non-Capture group
('|") # Capture group for quotes, backreference group 1
[^'"]*? # Lazy Negated character class for quotes
\1 # Backreference 1
) # Close Inside Non-Capture group
[^>]*? # Lazy Negated character class for closing HTML tag
)* # Close Outside Non-Capture group
> # Close HTML tags
This is a slight improvement on Vasili Syrakis answer. It handles "…" and '…' completely separately, and does not use the *? qualifier.
Regular expression
<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>
Demo
http://regex101.com/r/jO1oQ1
Explanation
< # start of HTML tag
[^'">]* # any non-single, non-double quote or greater than
( # outer group
( # inner group
"[^"]*" # "..."
| # or
'[^']*' # '...'
) #
[^'">]* # any non-single, non-double quote or greater than
)* # zero or more of outer group
> # end of HTML tag
This version is slightly better than Vasilis's in that single quotes are allowed inside "…", and double quotes are allowed inside '…', and that a (incorrect) tag like <a href='> will not be matched.
It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace ( with (?:, in all places. (Just using ( makes the regex shorter, and a little bit more readable).
(<.+?>[^<]+>)|(<.+?>)
you can make two regexs than put them togather by using '|',
in this case :
(<.+?>[^<]+>) #will match some text <tag link="fo>o"> other text
(<.+?>) #will match some text <tag link="foo"> other text
if the first case match, it will not use second regex, so make sure you put special case in the firstplace.
If you want this to work with escaped double quotes, try:
/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g
For example:
const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
return exec ? exec.index : -1;
})(gtExp.exec(xml));
And if you're parsing through a bunch of XML, you'll want to set .lastIndex.
gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes

How to extract regex comment

I have a regex like this
(?<!(\w/))$#Cannot end with a word and slash
I would like to extract the comment from the end. While the example does not reflect this case, there could be a regex with includes regex on hashes.
\##value must be a hash
What would the regex be to extract the comment ensuring it is safe when used against regex which could contain #'s that are not comments.
Here's a .Net flavored Regex for partly parsing .Net flavor patterns, which should get pretty close:
\A
(?>
\\. # Capture an escaped character
| # OR
\[\^? # a character class
(?:\\.|[^\]])* # which may also contain escaped characters
\]
| # OR
\(\?(?# inline comment!)\#
(?<Comment>[^)]*)
\)
| # OR
\#(?<Comment>.*$) # a common comment!
| # OR
[^\[\\#] # capture any regular character - not # or [
)*
\z
Luckily, in .Net each capturing group remembers all of its captures, and not just the last, so we can find all captures of the Comment group in a single parse. The regex pretty much parses regular expression - but hardly fully, it just parses enough to find comments.
Here's how you use the result:
Match parsed = Regex.Match(pattern, pattern,
RegexOptions.IgnorePatternWhitespace |
RegexOptions.Multiline);
if (parsed.Success)
{
foreach (Capture capture in parsed.Groups["Comment"].Captures)
{
Console.WriteLine(capture.Value);
}
}
Working example: http://ideone.com/YP3yt
One last word of caution - this regex assumes the whole pattern is in IgnorePatternWhitespace mode. When it isn't set, all # are matched literally. Keep in mind the flag might change multiple times in a single pattern. In (?-x)#(?x)#comment, for example, regardless of IgnorePatternWhitespace, the first # is matched literally, (?x) turns the IgnorePatternWhitespace flag back on, and the second # is ignored.
If you want a robust solution you can use a regex-language parser.
You can probably adapt the .Net source code and extract a parser:
Reference Source - RegexParser.cs
GitHub - RegexParser.cs
Something like this should work (if you run it separately on each line of the regex). The comment itself (if it exists) will be in the third capturing group.
/^((\\.)|[^\\\#])*\#(.*)/
(\\.) matches an escaped character, [^\#] matches any non-slash non-hash characters, together with the * quantifier they match the entire line before the comment. Then the rest of the regex detects the comment marker and extracts the text.
One of the overlooked options in regex parsing is the RightToLeft mode.
extract the comment from the end.
One can simply the pattern if we work our way from the end of the line to the beginning. Such as
^
.+? # Workable regex
(?<Comment> # Comment group
(?<!\\) # Not a comment if escaped.
\# # Anchor for actual comment
[^#]+ # The actual commented text to stop at #
)? # We may not have a comment
$
Use the above pattern in C# with these options RegexOptions.RightToLeft | RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline
there could be a regex with includes regex on hashes
This line (?<!\\) # Not a comment if escaped. handles that situation by saying if there is a proceeding \, we do not have a comment.

Regular expression to cherry pick a multiline component of a paragraph sitting between tags (Not html)

In the following I need a Regexpr to capture the part between the <tagstart></tagstart>
Please note this is not html.
* real time results: shows results as you type
* code hinting: roll over your expression to see info on specific elements
* detailed results: roll over a match to see details & view group info below
* built in regex guide: doub<tagstart>le click entries to insert them into your expression
* online & desktop: regexr.com or download the desktop version for Mac, Windows, or Linux
* save your expressions: My Saved expr</tagstart>essions are saved locally
* search Community expressions and add your own
Thanks
EDIT: As #Kobi correctly points out in the comments, the much simpler version of the original post below is of course:
<(tagstart)>(.*?)</\1>
Since the original version also works and all the other statements remain true, I'll leave it as it is.
If (and only if) the tags cannot be nested:
<(tagstart)>((?:(?!</\1>).)*)</\1>
Explanation:
<(tagstart)> # matches "<tagstart>" and stores "tagstart" in group 1
( # begin group 2
(?: # begin non-capturing group
(?! # begin negative look-ahead (... not followed by)
</\1> # a closing tag with the same name as group 1
) # end negative look-ahead
. # if ok, match the next character
)* # end non-capturing group, repeat
) # end group 2 (stores everything between the tags)
</\1> # a closing tag with the same name as group 1
The regex needs to be applied in "single line" mode (sometimes called "dotall" mode). Either that or you substitute the . for [\s\S].
To generically match text between any two equally named tags, use <(\w+)> instead of <(tagstart)>.
Depending on your regex flavor, some things may work differently, like $1 instead of \1 for back-references, or meta-characters that need additional escaping.
See a Rubular demo.
Maybe this regexp: (\<tagstart\>)(.+)(\<\/tagstart\>)/s would help you? The second match would be what you are searching for. See demo for details.
#!/usr/bin/perl -w
undef $/;
$_ = <>;
m|<(.*?)>(.*)</\1>|s;
print $2;
If you really need just <tagstart>, replace the bits like <(.*?)> with <tagstart> and similar for closing. The undef $/ bit lets you slurp in a lot with a single read, and the $2 selects the second match group. The s and the end of the regex asks for . to match even new-line characters.