RegEx: Don't match a certain character if it's inside quotes - regex

Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.
Say I have this string:
some text <tag link="fo>o"> other text
I want to match the whole tag but if I use <[^>]+> it only matches <tag link="fo>.
How can I make sure that > inside of quotes can be ignored.
I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.

Regular Expression:
<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>
Online demo:
http://regex101.com/r/yX5xS8
Full Explanation:
I know this regex might be a headache to look at, so here is my explanation:
< # Open HTML tags
[^>]*? # Lazy Negated character class for closing HTML tag
(?: # Open Outside Non-Capture group
(?: # Open Inside Non-Capture group
('|") # Capture group for quotes, backreference group 1
[^'"]*? # Lazy Negated character class for quotes
\1 # Backreference 1
) # Close Inside Non-Capture group
[^>]*? # Lazy Negated character class for closing HTML tag
)* # Close Outside Non-Capture group
> # Close HTML tags

This is a slight improvement on Vasili Syrakis answer. It handles "…" and '…' completely separately, and does not use the *? qualifier.
Regular expression
<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>
Demo
http://regex101.com/r/jO1oQ1
Explanation
< # start of HTML tag
[^'">]* # any non-single, non-double quote or greater than
( # outer group
( # inner group
"[^"]*" # "..."
| # or
'[^']*' # '...'
) #
[^'">]* # any non-single, non-double quote or greater than
)* # zero or more of outer group
> # end of HTML tag
This version is slightly better than Vasilis's in that single quotes are allowed inside "…", and double quotes are allowed inside '…', and that a (incorrect) tag like <a href='> will not be matched.
It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace ( with (?:, in all places. (Just using ( makes the regex shorter, and a little bit more readable).

(<.+?>[^<]+>)|(<.+?>)
you can make two regexs than put them togather by using '|',
in this case :
(<.+?>[^<]+>) #will match some text <tag link="fo>o"> other text
(<.+?>) #will match some text <tag link="foo"> other text
if the first case match, it will not use second regex, so make sure you put special case in the firstplace.

If you want this to work with escaped double quotes, try:
/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g
For example:
const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
return exec ? exec.index : -1;
})(gtExp.exec(xml));
And if you're parsing through a bunch of XML, you'll want to set .lastIndex.
gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes

Related

Regex to catch strings that are not inside string pattern

I want to find a regex that catch all strings that are not inside name('stringName') pattern.
For example I have this text:
fdlfksj "hello1" dsffsf "hello2\"hi" name("Tod").name('tod') 'hello3'
I want my regex to catch the strings:
"hello1", "hello2\"hi", 'hello3' (it should also should catch "hello2\"hi" because I want to ignore " escaping).
I want also that my regex will ignore "Tod" because it's inside the pattern name("...")
How should I do it?
Here is my regex that doens't work:
((?<!(name\())("[^"]*"|'[^']*'))
It doesn't work with ignore escaping: \" and \'
and it's also not ignore name("Tod")
How can I fix it?
You can use the following regex:
(?<!name\()(["'])[^\)]+?(?<!\\)\1
It will match anything other than parenthesis ([^\)]+?):
preceeded by (["']) - a quote symbol
followed by (?<!\\)\1 - the same quote symbol, which is not preceeded by a slash
In order to avoid getting the values that come after name(, there's a condition that checks that (?<!name\().
Check the demo here.
(["'])((?:\\\1)|[^\1]*?)\1
Regex Explanation
( Capturing group
["'] Match " (double) or ' (single) quote
) Close group
( Capturing group
(?: Non-capturing group
\\\1 Match \ followed by the quote by which it was started
) Close non-capturing group
| OR
[^\1]*? Non-gready match anything except a quote by which it was started
) Close group
\1 Match the close quote
See the demo
You could get out of the way what you don't want, and use a capture group for what you want to keep.
The matches that you want are in capture group 2.
name\((['"]).*?\1\)|('[^'\\]*(?:\\.[^'\\]*)*'|"[^"\\]*(?:\\.[^"\\]*)*")
Explanation
name\((['"]).*?\1\) Match name and then from the opening parenthesis till closing parenthesis between the same type of quote
| Or
( Capture group 2
('[^'\\]*(?:\\.[^'\\]*)*' match the single quoted value including matching escapes ones
|Or
[^"\\]*(?:\\.[^"\\]*)*" The same for the double quotes
) Close group 2
Regex demo

Why isnt greedy matching working in perl regex group

I am trying to grab only whats BETWEEN the body tags in html with perl regex (so don't want to include the actual body tags, thus using the groups to throw away the tags to variables).
Here are some short test subjects:
<body>test1</body>
<body style="bob">test2</body>
So first, simple version I tried was:
(?<=<body>).*(?=</body>)
which returns test 1 and empty string
So then I tried:
(?<=<body).*(?=</body>)
Which now gives a result for both tests, but of course has garbage: ">test1" and " style="bob">test2"
I've tried every variation of greedy match now in the first version, e.g.:(?<=<body.*>).*(?=</body>)
But it simply will not work! Any time I put the * in there I get errors. Anybody able to help out?
I am trying to grab only whats BETWEEN the body tags
In that case:
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <DATA>) {
if ($line =~ m{ <body [^>]*> (.+) </body> }xs) {
print "[$1]\n";
}
}
__DATA__
<body>test1</body>
<body style="bob">test2</body>
<!-- <body class="one"> --><body>This is why you should use an HTML parser</body>
Output:
[test1]
[test2]
[ --><body>This is why you should use an HTML parser]
You're looking for
while ($html =~ / <body[^>]*> ( (?: (?! </body\b ). )* ) /sxg) {
say $1;
}
I don't think using $& is efficient. Personally, I'd use capture groups
but this works pretty good.
/<(body)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>\K[\S\s]*?(?=<\/\1\s*>)/
https://regex101.com/r/EkPkLb/1
Expanded
<
( body ) # (1)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
\K
[\S\s]*?
(?= </ \1 \s* > )
Note that to really find a particular tag, you have to consume all the
previous tags via a (*SKIP)(?!), else your tag could be embedded inside
script literals, comments or invisible content.
I wouldn't worry too much about it.
If you're interested I could post a fairly large proper regex,
but I doubt you'd be interested.
Choosing the best pattern for your data depends on what kind of characters will be contained in your body tags. An additional consideration is whether you want to aim for efficiency or minimal memory.
These are some suitable (or not) patterns for your case:
93steps ~<body[^>]*>\K.*(?=</body>)~ #no capture group,no newline matches
105steps ~<body[^>]*>\K[\S\s]*?(?=</body>)~ #no capture group, newline matches
87steps ~<body[^>]*>(.*)</body>~ #capture group, no newline matches
96steps ~<body[^>]*>([\S\s]*?)</body>~ #capture group, newline matches
Here is a Pattern Demo with three samples to show the impact of newline characters in your body text.

Exclude match results (pcre regex)

So I've this regex (from https://github.com/savetheinternet/Tinyboard/blob/master/inc/functions.php#L1620)
((?:https?:\/\/|ftp:\/\/|irc:\/\/)[^\s<>()"]+?(?:\([^\s<>()"]*?\)[^\s<>()"]*?)*)((?:\s|<|>|"|\.|\]|!|\?|,|,|")*(?:[\s<>()"]|$))
it works for matching links like: http://stackoverflow.com/ etc..
question is, how I can exclude these kind of markup matches (mainly the url ja img parts):
[url]http://stackoverflow.com/[/url]
[url=http://stackoverflow.com/]http://stackoverflow.com/[/url]
[img]http://cdn.sstatic.net/stackoverflow/img/sprites.png[/img]
[img=http://cdn.sstatic.net/stackoverflow/img/sprites.png]
To exclude this you can add at the begining of your expression this subpattern:
(?:\[(url|img)](?>[^[]++|[(?!\/\g{-1}))*+\[\/\g{-1}]|\[(?:url|img)=[^]]*+])(*SKIP)(*FAIL)|your pattern here
The goal of this is to try to match the parts you don't want before and forces the regex engine to fail with the backtracking control verb (*FAIL). The (*SKIP) verb forces the regex engine to not retry the substring matched before when the subpattern fails after.
You can find more informations about these features here.
Notice: assuming that you are using PHP for this pattern, you can improve a little bit this very long pattern by replacing the default delimiter / by ~ to avoid to escape all / in the pattern and by using the verbose mode (x modifier) with a Nowdoc syntax. Like this you can comment it, make it more readable and easily improve the pattern
Example:
$pattern = <<<'EOF'
~
### skipping url and img bbcodes ###
(?:
\[(url|img)] # opening bbcode tag
(?>[^[]++|[(?!/\g{-1}))*+ # possible content between tags
\[/\g{-1}] # closing bbcode tag
|
\[(?:url|img)= [^]]*+ ] # self closing bbcode tags
)(*SKIP)(*FAIL) # forces to fail and skip
| # OR
### a link ###
(
(?:https?|ftp|irc):// # protocol
[^\s<>()"]+?
(?:
\( [^\s<>()"]*? \) # part between parenthesis
[^\s<>()"]*?
)*
)
(
(?:[]\s<>".!?,]|,|")*
(?:[\s<>()"]|$)
)
~x
EOF;
You could solve it with negative look-behind assertion.
(?<!pattern)
In your case, you can check if there is no ] or = character just before the matching link.
Below regex will make sure that exactly this doesn't happen:
(?<!(?:\=|\]))((?:https?:\/\/|ftp:\/\/|irc:\/\/)[^\s<>()"]+?(?:\([^\s<>()"]*?\)[^\s<>()"]*?)*)((?:\s|<|>|"|\.|\]|!|\?|,|,|")*(?:[\s<>()"]|$))
Note that the only part added is (?<!(?:\=|\])) right in the beginning and that it will not match a link in something like <a href=http://example.com> but your question does not specify this... so impove the question if that's expected behaviour or work it out yourself using negative look behind.

Extract contents between two tags

I have a simple HTML string. From that string I would like to extract the contents BETWEEN two HTML tags.
My source string is this:
"Hello <b>world</b> test"
I would like to extract: "world"
How do I do that?
Assuming you don't mean any tag, but a specific tag (in this case <b>), and assuming that your HTML is well-formed and thus doesn't contain nested <b> tags:
(?s)<b[^>]*>((?:(?!</b>).)*)</b>
The result will be in group number 1.
Explanation:
(?s) # Allow the dot to match newlines (hope you're not using JavaScript)
<b[^>]*> # Match opening <b> tag
( # Capture the following:
(?: # Match (and don't capture)...
(?! # (as long as we're not at the start of
</b> # the string </b>
) # )
. # any character.
)* # Repeat any number of times
) # End of capturing group.
</b> # Match closing </b> tag
While this might be possible in extremely simple contexts, I'd strongly recommend against it. Regexp is not powerful enough to parse HTML. Use a proper HTML parsing library.
I don't know what language you're using, this is a VB.NET example:
the pattern would be "hello (.*) test"
and the Regex.Matches function would take your input and pattern and return a collection of matches. Each match would contain groups, group 0 would be the whole match: "hello world test" and group 1 would be the text inside the group: "world"
System.Text.RegularExpressions.Regex.Matches("hello world test", "hello (.+) test").Item(0).Groups(1)
And like Dervall said Regex might not be powerful enough for what you want to do and you might need heavy modification of the pattern to work with HTML, like making white space (spaces, tabs, and new lines) optional as 1 example.
I would use the following expression which will also validate that the end tag matches the beginning tag.
(?<=<(b)>)[^>]+(?=</\1>)
A more "digestible" example would be:
(?<=<(b)>)[^>]+(?=</b>)

How to extract regex comment

I have a regex like this
(?<!(\w/))$#Cannot end with a word and slash
I would like to extract the comment from the end. While the example does not reflect this case, there could be a regex with includes regex on hashes.
\##value must be a hash
What would the regex be to extract the comment ensuring it is safe when used against regex which could contain #'s that are not comments.
Here's a .Net flavored Regex for partly parsing .Net flavor patterns, which should get pretty close:
\A
(?>
\\. # Capture an escaped character
| # OR
\[\^? # a character class
(?:\\.|[^\]])* # which may also contain escaped characters
\]
| # OR
\(\?(?# inline comment!)\#
(?<Comment>[^)]*)
\)
| # OR
\#(?<Comment>.*$) # a common comment!
| # OR
[^\[\\#] # capture any regular character - not # or [
)*
\z
Luckily, in .Net each capturing group remembers all of its captures, and not just the last, so we can find all captures of the Comment group in a single parse. The regex pretty much parses regular expression - but hardly fully, it just parses enough to find comments.
Here's how you use the result:
Match parsed = Regex.Match(pattern, pattern,
RegexOptions.IgnorePatternWhitespace |
RegexOptions.Multiline);
if (parsed.Success)
{
foreach (Capture capture in parsed.Groups["Comment"].Captures)
{
Console.WriteLine(capture.Value);
}
}
Working example: http://ideone.com/YP3yt
One last word of caution - this regex assumes the whole pattern is in IgnorePatternWhitespace mode. When it isn't set, all # are matched literally. Keep in mind the flag might change multiple times in a single pattern. In (?-x)#(?x)#comment, for example, regardless of IgnorePatternWhitespace, the first # is matched literally, (?x) turns the IgnorePatternWhitespace flag back on, and the second # is ignored.
If you want a robust solution you can use a regex-language parser.
You can probably adapt the .Net source code and extract a parser:
Reference Source - RegexParser.cs
GitHub - RegexParser.cs
Something like this should work (if you run it separately on each line of the regex). The comment itself (if it exists) will be in the third capturing group.
/^((\\.)|[^\\\#])*\#(.*)/
(\\.) matches an escaped character, [^\#] matches any non-slash non-hash characters, together with the * quantifier they match the entire line before the comment. Then the rest of the regex detects the comment marker and extracts the text.
One of the overlooked options in regex parsing is the RightToLeft mode.
extract the comment from the end.
One can simply the pattern if we work our way from the end of the line to the beginning. Such as
^
.+? # Workable regex
(?<Comment> # Comment group
(?<!\\) # Not a comment if escaped.
\# # Anchor for actual comment
[^#]+ # The actual commented text to stop at #
)? # We may not have a comment
$
Use the above pattern in C# with these options RegexOptions.RightToLeft | RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline
there could be a regex with includes regex on hashes
This line (?<!\\) # Not a comment if escaped. handles that situation by saying if there is a proceeding \, we do not have a comment.