I am trying to parse a "wrong html" to fix it using perl regex.
The wrong html is the following: <p>foo<p>bar</p>foo</p>
I would like perl regex to return me the : <p>foo<p>
I tried something like: '|(<p\b[^>]*>(?!</p>)*?<p[^>]*>)|'
with no success because I cannot repeat (?!</p>)*?
Is there a way in Perl Regex to say all charactère except the following sequence (in my case </p>)
Try something like:
<p>(?:(?!</?p>).)*</p>(?!(?:(?!</?p>).)*(<p>|$))
A quick break down:
<p>(?:(?!</?p>).)*</p>
matches <p> ... </p> that does not contain either <p> and </p>. And the part:
(?!(?:(?!</?p>).)*(<p>|$))
is "true" when looking ahead ((?! ... )) there is no <p> or the end of the input ((<p>|$)), without any <p> and </p> in between ((?:(?!</?p>).)*).
A demo:
my $txt="<p>aaa aa a</p> <p>foo <p>bar</p> foo</p> <p> bb <p>x</p> bb</p>";
while($txt =~ m/(<p>(?:(?!<\/?p>).)*<\/p>)(?!(?:(?!<\/?p>).)*(<p>|$))/g) {
print "Found: $1\n";
}
prints:
Found: <p>bar</p>
Found: <p>x</p>
Note that this regex trickery only works for <p>baz</p> in the string:
<p>foo <p>bar</p> <p>baz</p> foo</p>
<p>bar</p> is not matched! After replacing <p>baz</p>, you could do a 2nd run on the input, in which case <p>bar</p> will be matched.
I concur with Andy. Parsing nontrivial HTML with regexps is a world of pain.
Have a good look at HTML::TreeBuilder::XPath and HTML::DOM for making structural changes to HTML documents.
This regexp:
<p>(?:(?!</p>).)*?<p>
when matched with
<p>foo<p>bar</p>foo</p>
results in
<p>foo<p>
If you're trying to validate HTML then consider a module like HTML::Tidy or HTML::Lint.
Perhaps Marpa::HTML would help you. Read some interesting abilities it has on the author's blog about it. The short of it is that the parser works with the interpreter (I probably am getting some of the semantics incorrect) to figure out what should be present based on what CAN be present at a certain logical place in the code.
The examples shown therein fix similar problems as you seem to be dealing with in a much more consistent way than employing regexes which will inevitably suffer from edge cases.
Marpa::HTML comes with a command-line utility, built using the module, called html_fmt. This implements a parsing engine to fix and pretty-print html. Here is an example. If 'bad.html' contains <p>foo<p>bar</p>foo</p> then html_fmt bad.html gives:
<!-- Following start tag is replacement for a missing one -->
<html>
<!-- Following start tag is replacement for a missing one -->
<head>
</head>
<!-- Preceding end tag is replacement for a missing one -->
<!-- Following start tag is replacement for a missing one -->
<body>
<p>
foo
</p>
<!-- Preceding end tag is replacement for a missing one -->
<p>
bar
</p>
foo
<!-- Next line is cruft -->
</p>
</body>
<!-- Preceding end tag is replacement for a missing one -->
</html>
<!-- Preceding end tag is replacement for a missing one -->
Related
I'm using Sublime Text, and I want to use Find/Replace to make HTML to Markdown. One problem I encountered is how to replace multiple matches?
The HTML is below:
<blockquote>
<p> text 1 </p>
<p> text 2 </p>
<p> text 3 </p>
<p> text 4 </p>
</blockquote>
And I want to change it to
><p> text 1 </p>
><p> text 2 </p>
><p> text 3 </p>
><p> text 4 </p>
I use
<blockquote>\n(^.+$\n)+?.+</blockquote>
to capture the p tag within the blockquote. But how to replace it?
Thanks a lot.
I have tested this for your simple test case. The main problem is, it may or may not work for more complex input, where you may need to further customize the regex.
Find what:
(?:<blockquote>\s*+|(?<!\A)(?<!</blockquote>)\G)(.*)\s++(?:</blockquote>)?
This solution will clean the closing tag as it match the last line. It fixes the caveat in the first solution where the end tag </blockquote> is not removed.
Replace with:
\n> $1
Use regular expression mode and highlight matches to check what will be replaced.
It will strip all leading spaces, and leave only 1 space between > and the text.
The regex above is built based on my own answer to the question of solving this class of problem with regex alone: Collapse and Capture a Repeating Pattern in a Single Regex Expression.
My earlier solution is based on the second construct, while the current solution is based on the first construct. The initial solution is quoted here, in case you want to customize the regex to be more flexible with its end tag (e.g. free spacing):
(?:<blockquote>\s*+|(?!\A)\G\s++(?!</blockquote>))(.*)
You can do this in two steps.
1)<blockquote>((?:(?!<\/blockquote>).)*)<\/blockquote> replace by $1.
See demo.
http://regex101.com/r/dZ1vT6/35
2)^\s+ replace by <
See demo.
http://regex101.com/r/dZ1vT6/36
I'm trying to accomplish the same thing as seen here:
i.e. assuming you have a text like:
<p>something</p>
<!-- OPTIONAL -->
<p class="sdf"> some text</p>
<p> some other text</p>
<!-- OPTIONAL END -->
<p>The end</p>
What is the regex that would match:
<p class="sdf"> some text</p>
<p> some other text</p>
I've setup a live test here using:
<!-- OPTIONAL -->(.*?)<!-- OPTIONAL END -->
but it's not matching correctly. Also the accepted answer on the page didn't work for me. What am I missing?
Well unfortunately, RegExr is dependent on the JS RegExp implementation, which does not support the option to enable the flag/modifier that you need.
You are looking for the s (DotAll) modifier forcing the dot . to match newline sequences.
Live Demo on regular expressions 101
If you are using JavaScript, you can use this workaround:
/<!-- OPTIONAL -->([\S\s]*?)<!-- OPTIONAL END -->/
I have a plugin tag [crayon ...] that may or may not be rendered in a <p></p> block like so:
<p>This is a <b>sentence</b> [crayon ...] The Crayon [/crayon] of words. </p>
Since my tag is replaced by a <div> tag, the <p> is left disjoint from </p> and the browser closes it for me, leaving a blank paragraph above my plugin. In any case, the markup is invalid and has weird outcomes. My problem is that I need to detect if [crayon lies between a <p></p> block. I have found two ways so far:
Use <p(?:\s+[^>]*)?>(.*?)</p(?:\s+[^>]*)?> and search for [crayon in the capture.
Use <p[^>]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*\[crayon for the case of <p>...[crayon where ... doesn't contain a </p> or <p> and a similar method for a </p> after the [crayon] tag.
The second method is harder to read but will fail if a </p> is captured before my tag. It doesn't require any further processing to find my tag within the <p></p> like the first. However, the first regex is much simpler and will execute quicker. Which should I use, and is there a better way?
EDIT:
For method 2, this beast works:
<p[^<]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*((?:\[crayon[^\]]*\].*?\[/crayon\])|(?:\[crayon[^\]]*/\]))(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*</p[^<]*>
Edit with improved regex, notice I also stole your open p tag detection ;). On PHP, had to add the s modifier for multi line match:
/(?<!<!--)<p[^<]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*\[crayon.*?\].*?\[\/crayon\].*?<\/p>(?!(\s)?-->)/s
The following string was used for testing. 5 matches expected, 179 steps taken (the single regex from question took 285 steps):
<p>This is a <b>sentence</b> [crayon]...[/crayon] of words.</p>
<p class="large"> Paragraph with parameters [crayon]...[/crayon]</p>
<p>[crayon with-parameters=true]...[/crayon]</p>
<p>
Multiline paragraph [crayon]...[/crayon].
Lorem ipsum.
</p>
<p>...</p><p>[crayon]...[/crayon]</p>
<!-- <p> --> This is a <b>sentence</b> [crayon]...[/crayon] of words.<!-- </p> -->
<pizza>yummy</pizza>
Any improvement?
I have such string
<p>test <span class=\"match\">match</span> <span class=\"testtes\">dddddd</span></p>
I want to get string without tags. But I want to save highlighting by class "match":
test <span class=\"match\">match</span> dddddd
If I want to just remove all tags I substitute all substrings that satisfied regexp /<\/?[^>]*>/ by empty string. But what regexp should I use in my special case?
UPD: The algorithm is: if you see and some sentence without tags and then then you shouldn't remove these spans; otherwise you should remove all tags
I can could do someting like this
<\/?(?![^>]*class=\\"match)[^>]*>
This would preserve the opening tag and result in this
test <span class=\"match\">match dddddd
See it here on Regexr
But how should I find the matching closing tag?
<p>test <span class=\"match\">match</span> <span class=\"testtes\">dddddd</span></p>
^^^^^^^ or the next one? ^^^^^^^
Regex can't know which closing tag belongs to the opening <span> tag that contains that class. I don't have the possibility to find matching closing tags. So its not a good idea to do this using regex.
I am quite sure the language you are using has an html parser that can be used to do this task.
I'm a total regexp noob. I'm working with wordpress and I'm desperately trying to deal with wordpress's wautop, which I hate and love (more hate!). Anyways I'm trying to remove <p> tags around certain commands.
Here's what I get:
<p>
[hide]
<img.../>
[/hide]
</p>
or
<p>
[imagelist]
<img .../>
<img .../>
[/imagelist]
</p>
Here's what I'd like:
[hide]
<img.../>
[/hide]
or
[imagelist]
<img .../>
<img .../>
[/imagelist]
I've tried:
preg_replace('/<p[^>]*>(\[[^>]*\])<\/p[^>]*>/', '$1', $content); // No luck!
EDIT:
When I am doing the regexp it is still just a variable containing text.. It is not parsed as html yet. I know it is possible because I already did it with getting rid of p tags around an image tag. So I just need a regexp to handle text that will be parsed as html at some point in the future.
Here's a similar question
Thanks!
Matt Mueller
You can't use regular expressions to parse HTML, because HTML is, by definition, a non-regular language. Period, end of discussion.
The language of matching HTML tags is context-free, not regular. This means regular expressions are probably not the right tool to use here. Context-free languages require parsers rather than regular expressions. So, you can either remove ALL <p> and </p> tags with a regular expression, or you can use an HTML parser to remove matching tags from certain parts of your document.
Try this regex:
'%<p[^>]*>\s*(\[([^\[\]]+)\].*?\[/\2\])\s*</p>%s'
Explanation:
\[([^\[\]]+)\] matches the opening bbcode tag and captures the tag name in group #2.
\[/\2\] matches a corresponding losing tag.
.*? matches anything, reluctantly. Thanks to the s flag at the end, it also matches newlines. The effect of the reluctant .*? is that it stops matching the first time it finds a closing bbcode tag with the right name. If tags are nested (within tags with the same name) or improperly balanced, it won't work correctly. I wouldn't expect that be a problem, but I have no experience with WordPress, so YMMV.