I'm trying to accomplish the same thing as seen here:
i.e. assuming you have a text like:
<p>something</p>
<!-- OPTIONAL -->
<p class="sdf"> some text</p>
<p> some other text</p>
<!-- OPTIONAL END -->
<p>The end</p>
What is the regex that would match:
<p class="sdf"> some text</p>
<p> some other text</p>
I've setup a live test here using:
<!-- OPTIONAL -->(.*?)<!-- OPTIONAL END -->
but it's not matching correctly. Also the accepted answer on the page didn't work for me. What am I missing?
Well unfortunately, RegExr is dependent on the JS RegExp implementation, which does not support the option to enable the flag/modifier that you need.
You are looking for the s (DotAll) modifier forcing the dot . to match newline sequences.
Live Demo on regular expressions 101
If you are using JavaScript, you can use this workaround:
/<!-- OPTIONAL -->([\S\s]*?)<!-- OPTIONAL END -->/
Related
I would like to use regex to search for all instances of a footer in a epub like the following sample:
<p class="calibre1">2 <> GENERAL INTRODUCTION </p>
of the more general format:
<p class="calibre1">[page number from 1-1000][" <>"][Title of section]</p>
My goal is to use calibre's regex to find all instances of that footer and delete them but I've tried these expressions and none of them work to even find the one above example:
<p class="calibre1">[0-9] <>[^>] </p>
<p class="calibre1">[0-9] <> [\w] </p>
and even the general:
<p class="calibre1">[\w--[\d_]]</p>
<p class="calibre1">[0-9] [.]</p>
<p class="calibre1">[0-9] *[.]</p>
<p class="calibre1">[0-9][*.]</p>
I'm new to regex and am pulling my hair out. Please help with my (mis)understanding.
This should work for what you want:
^<p[ \t]*class="calibre1">[0-9]+[^<]*<>[^<]*<[/]p>$
Please try this:
^<p class="calibre1">\d{1,4}.*</p>$
^ - Anchor to the start of the line
<p class="calibre1"> - Actual text to match
\d{1,4} - match 1 to 4 digits
.* - then zero or more characters
<\p> - until the closing tag
$ - anchored to the end of the line
I have a html with:
<p class="s5">Chapter 1 – General Information</p>
<p class="s5">Section 1 – Example</p>
<p>Some text</p>
<p class="s5">Chapter 2 – Introduction</p>
and I want to replace every <p class="s5"> tag that starts with Chapter for <h1> ... </h1>.
How I peform it with regex substitution in SublimeText?
You haven't indicated which language/tool you're using, so here's a generic solution:
Search: (?<=<p class="s5">)(Chapter[^<]*)
Replace: <h1>$1</h1>
Breakdown:
(?<=<p class="s5">) is a look behind (non-consuming assertion) for <p class="s5">
(Chapter[^<]*) is text starting with Chapter and everything up to the next <
If your tool doesn't understand look behinds, you can just consume and replace the preceding input instead:
Search: <p class="s5">(Chapter[^<]*)
Replace: <p class="s5"><h1>$1</h1>
Note that languages/tool vary with back-reference syntax; the $1 may need to be \1 instead.
I'm using Sublime Text, and I want to use Find/Replace to make HTML to Markdown. One problem I encountered is how to replace multiple matches?
The HTML is below:
<blockquote>
<p> text 1 </p>
<p> text 2 </p>
<p> text 3 </p>
<p> text 4 </p>
</blockquote>
And I want to change it to
><p> text 1 </p>
><p> text 2 </p>
><p> text 3 </p>
><p> text 4 </p>
I use
<blockquote>\n(^.+$\n)+?.+</blockquote>
to capture the p tag within the blockquote. But how to replace it?
Thanks a lot.
I have tested this for your simple test case. The main problem is, it may or may not work for more complex input, where you may need to further customize the regex.
Find what:
(?:<blockquote>\s*+|(?<!\A)(?<!</blockquote>)\G)(.*)\s++(?:</blockquote>)?
This solution will clean the closing tag as it match the last line. It fixes the caveat in the first solution where the end tag </blockquote> is not removed.
Replace with:
\n> $1
Use regular expression mode and highlight matches to check what will be replaced.
It will strip all leading spaces, and leave only 1 space between > and the text.
The regex above is built based on my own answer to the question of solving this class of problem with regex alone: Collapse and Capture a Repeating Pattern in a Single Regex Expression.
My earlier solution is based on the second construct, while the current solution is based on the first construct. The initial solution is quoted here, in case you want to customize the regex to be more flexible with its end tag (e.g. free spacing):
(?:<blockquote>\s*+|(?!\A)\G\s++(?!</blockquote>))(.*)
You can do this in two steps.
1)<blockquote>((?:(?!<\/blockquote>).)*)<\/blockquote> replace by $1.
See demo.
http://regex101.com/r/dZ1vT6/35
2)^\s+ replace by <
See demo.
http://regex101.com/r/dZ1vT6/36
I have a plugin tag [crayon ...] that may or may not be rendered in a <p></p> block like so:
<p>This is a <b>sentence</b> [crayon ...] The Crayon [/crayon] of words. </p>
Since my tag is replaced by a <div> tag, the <p> is left disjoint from </p> and the browser closes it for me, leaving a blank paragraph above my plugin. In any case, the markup is invalid and has weird outcomes. My problem is that I need to detect if [crayon lies between a <p></p> block. I have found two ways so far:
Use <p(?:\s+[^>]*)?>(.*?)</p(?:\s+[^>]*)?> and search for [crayon in the capture.
Use <p[^>]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*\[crayon for the case of <p>...[crayon where ... doesn't contain a </p> or <p> and a similar method for a </p> after the [crayon] tag.
The second method is harder to read but will fail if a </p> is captured before my tag. It doesn't require any further processing to find my tag within the <p></p> like the first. However, the first regex is much simpler and will execute quicker. Which should I use, and is there a better way?
EDIT:
For method 2, this beast works:
<p[^<]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*((?:\[crayon[^\]]*\].*?\[/crayon\])|(?:\[crayon[^\]]*/\]))(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*</p[^<]*>
Edit with improved regex, notice I also stole your open p tag detection ;). On PHP, had to add the s modifier for multi line match:
/(?<!<!--)<p[^<]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*\[crayon.*?\].*?\[\/crayon\].*?<\/p>(?!(\s)?-->)/s
The following string was used for testing. 5 matches expected, 179 steps taken (the single regex from question took 285 steps):
<p>This is a <b>sentence</b> [crayon]...[/crayon] of words.</p>
<p class="large"> Paragraph with parameters [crayon]...[/crayon]</p>
<p>[crayon with-parameters=true]...[/crayon]</p>
<p>
Multiline paragraph [crayon]...[/crayon].
Lorem ipsum.
</p>
<p>...</p><p>[crayon]...[/crayon]</p>
<!-- <p> --> This is a <b>sentence</b> [crayon]...[/crayon] of words.<!-- </p> -->
<pizza>yummy</pizza>
Any improvement?
I am trying to parse a "wrong html" to fix it using perl regex.
The wrong html is the following: <p>foo<p>bar</p>foo</p>
I would like perl regex to return me the : <p>foo<p>
I tried something like: '|(<p\b[^>]*>(?!</p>)*?<p[^>]*>)|'
with no success because I cannot repeat (?!</p>)*?
Is there a way in Perl Regex to say all charactère except the following sequence (in my case </p>)
Try something like:
<p>(?:(?!</?p>).)*</p>(?!(?:(?!</?p>).)*(<p>|$))
A quick break down:
<p>(?:(?!</?p>).)*</p>
matches <p> ... </p> that does not contain either <p> and </p>. And the part:
(?!(?:(?!</?p>).)*(<p>|$))
is "true" when looking ahead ((?! ... )) there is no <p> or the end of the input ((<p>|$)), without any <p> and </p> in between ((?:(?!</?p>).)*).
A demo:
my $txt="<p>aaa aa a</p> <p>foo <p>bar</p> foo</p> <p> bb <p>x</p> bb</p>";
while($txt =~ m/(<p>(?:(?!<\/?p>).)*<\/p>)(?!(?:(?!<\/?p>).)*(<p>|$))/g) {
print "Found: $1\n";
}
prints:
Found: <p>bar</p>
Found: <p>x</p>
Note that this regex trickery only works for <p>baz</p> in the string:
<p>foo <p>bar</p> <p>baz</p> foo</p>
<p>bar</p> is not matched! After replacing <p>baz</p>, you could do a 2nd run on the input, in which case <p>bar</p> will be matched.
I concur with Andy. Parsing nontrivial HTML with regexps is a world of pain.
Have a good look at HTML::TreeBuilder::XPath and HTML::DOM for making structural changes to HTML documents.
This regexp:
<p>(?:(?!</p>).)*?<p>
when matched with
<p>foo<p>bar</p>foo</p>
results in
<p>foo<p>
If you're trying to validate HTML then consider a module like HTML::Tidy or HTML::Lint.
Perhaps Marpa::HTML would help you. Read some interesting abilities it has on the author's blog about it. The short of it is that the parser works with the interpreter (I probably am getting some of the semantics incorrect) to figure out what should be present based on what CAN be present at a certain logical place in the code.
The examples shown therein fix similar problems as you seem to be dealing with in a much more consistent way than employing regexes which will inevitably suffer from edge cases.
Marpa::HTML comes with a command-line utility, built using the module, called html_fmt. This implements a parsing engine to fix and pretty-print html. Here is an example. If 'bad.html' contains <p>foo<p>bar</p>foo</p> then html_fmt bad.html gives:
<!-- Following start tag is replacement for a missing one -->
<html>
<!-- Following start tag is replacement for a missing one -->
<head>
</head>
<!-- Preceding end tag is replacement for a missing one -->
<!-- Following start tag is replacement for a missing one -->
<body>
<p>
foo
</p>
<!-- Preceding end tag is replacement for a missing one -->
<p>
bar
</p>
foo
<!-- Next line is cruft -->
</p>
</body>
<!-- Preceding end tag is replacement for a missing one -->
</html>
<!-- Preceding end tag is replacement for a missing one -->