Regex "replace until" - regex

This is a bit more complicated (to me!) than similar questions.
I'm trying to copy some dates with Regex in some old HTML, into another location, but have a problem with the replacements extending outside the required block, when I repeat the search and replace.
In the example below, each block <ul> .. </ul> represents a specific date that is included with first <li>..<b>. I want to copy "13 Oct 2005" into the subsequent <li>..<b>, and likewise, "14 Oct 2005" into the subsequent <li>..<b> too, but not outside its <ul> .. </ul> block.
This I can do using the following Regex (used by Funduc's Search & Replace utility for Windows, apparently "a subset of UNIX grep notation)":
Search: <li>+[0-9] *<b>*[]<li><b>
Replace: <li>%1 %2\<b>%3\<li>%1 %2\<b>
+[0-9] is one or more numeric; * is any alphabetical; *[] is anything; %1 .. %4 are replacement positions.
Here is my original HTML
<ul>
<li>13 Oct 2005<b>Title One</b>
Some text
<p>
<li><b>Title Two</b>
Some more text
<p>
</ul>
<ul>
<li>14 Oct 2005<b>Title 3</b>
Another line of text
<p>
<li><b>Title 4</b>
Yet another line of text
<p>
<li><b>Title 5</b>
Some text
<p>
</ul>
After my first script run, this correctly gives me:
<ul>
<li>13 Oct 2005<b>Title One</b>
Some text
<p>
<li>13 Oct 2005<b>Title Two</b>
Some more text
<p>
</ul>
<ul>
<li>14 Oct 2005<b>Title 3</b>
Another line of text
<p>
<li>14 Oct 2005<b>Title 4</b>
Yet another line of text
<p>
<li><b>Title 5</b>
Some text
<p>
</ul>
But after my second script run, the 13 Oct 2005 is added incorrectly to the next <ul> .. <ul> block:
<ul>
<li>13 Oct 2005<b>Title One</b>
Some text
<p>
<li>13 Oct 2005<b>Title Two</b>
Some more text
<p>
</ul>
<ul>
<li>14 Oct 2005<b>Title 3</b>
Another line of text
<p>
<li>14 Oct 2005<b>Title 4</b>
Yet another line of text
<p>
<li>13 Oct 2005<b>Title 5</b> <-- wrong !!!
Some text
<p>
</ul>
I have about 20,000 <ul> .. <ul> blocks (hence the script), and each block contains between 1-10 <li><b> tags with titles. I assumed it couldn't be done in one pass.

tl;dr Regular expressions are the wrong tool here; use something stronger.
Please, no
Parsing HTML with regex is a bad idea. It's even in the SO regex reference: Reference - What does this regex mean?
So don't do it.
Yes, I see you're still here
If you're completely in charge of your HTML files, and you don't want a general-purpose tool or anything, I can't stop you from working with regexes.
Still, no
As you've noticed, a single regular expression is the wrong tool here. Crucially, regular expressions, even powerful ones with backreferences and lookahead, don't have extra memory of what they're not looking at.
But that's exactly what you're asking for here! You want to know if you've left the <ul> block, while looking only at the dates and tags that delimit what you're searching for.
What you need is a programming language.
A style point
The regex syntax you use is nothing like unix-style regex syntax. That's forgivable.
What's unforgivable (to any regex enthusiast) is using multi-line regexes without explaining why it's necessary.
Aha! But wait! Now that we've concluded we need a programming language, multi-line regexes are no longer necessary!
So, let's stop using them.
Why are we still here?
At this point, I'm here for self-flagellation: I've concluded that, on one hand, we need a programming language, because we need something stronger than regexes; on the other hand, I won't use an HTML parser because I swore to myself that I would use only regexes.
I am clearly an idiot, and you shouldn't listen to anything I have to say.
Perverse solution
We can fix the file in one pass, once we allow ourselves to use a programming language. We just need to save a little state: the date in the current <ul> block.
Here's a Perl script which fits the terrible goal I set for myself (I'm using 5.22):
#! /usr/bin/perl
use strict;
use warnings;
my $date_re = qr/^<li>(\d+ [[:alpha:]]+ \d+)<b>/;
my $non_date_title_re = qr/^<li>(<b>.*<\/b>)$/;
my $local_date = '';
while (<>) {
if (/<\/ul>/) {
$local_date = '';
} elsif (/$date_re/) {
$local_date = $1;
} elsif (/$non_date_title_re/) {
s/$non_date_title_re/<li>$local_date$1/;
}
print;
}
You may not read Perl, but what's going on here is pretty clear: first, save some regexes in local variables, for clarity: one for dates between <li> and <b>, and one for titles with no dates.
For each line in the file, if it contains </ul>, invalidate the local date we've saved. (For the file in your question, this part is not strictly necessary.) If instead the line matches the date regex, save the date for later. If instead the line matches a non-date tile, use substitution to put the date we saved right after the <li>.
But really, use an HTML parser.

Related

Using regex in Find/Replace to replace multiple matches

I'm using Sublime Text, and I want to use Find/Replace to make HTML to Markdown. One problem I encountered is how to replace multiple matches?
The HTML is below:
<blockquote>
<p> text 1 </p>
<p> text 2 </p>
<p> text 3 </p>
<p> text 4 </p>
</blockquote>
And I want to change it to
><p> text 1 </p>
><p> text 2 </p>
><p> text 3 </p>
><p> text 4 </p>
I use
<blockquote>\n(^.+$\n)+?.+</blockquote>
to capture the p tag within the blockquote. But how to replace it?
Thanks a lot.
I have tested this for your simple test case. The main problem is, it may or may not work for more complex input, where you may need to further customize the regex.
Find what:
(?:<blockquote>\s*+|(?<!\A)(?<!</blockquote>)\G)(.*)\s++(?:</blockquote>)?
This solution will clean the closing tag as it match the last line. It fixes the caveat in the first solution where the end tag </blockquote> is not removed.
Replace with:
\n> $1
Use regular expression mode and highlight matches to check what will be replaced.
It will strip all leading spaces, and leave only 1 space between > and the text.
The regex above is built based on my own answer to the question of solving this class of problem with regex alone: Collapse and Capture a Repeating Pattern in a Single Regex Expression.
My earlier solution is based on the second construct, while the current solution is based on the first construct. The initial solution is quoted here, in case you want to customize the regex to be more flexible with its end tag (e.g. free spacing):
(?:<blockquote>\s*+|(?!\A)\G\s++(?!</blockquote>))(.*)
You can do this in two steps.
1)<blockquote>((?:(?!<\/blockquote>).)*)<\/blockquote> replace by $1.
See demo.
http://regex101.com/r/dZ1vT6/35
2)^\s+ replace by <
See demo.
http://regex101.com/r/dZ1vT6/36

Omit certain sequence of characters using Regex

I need a regular Expression to get a string in between <ul> and </ul> tag...
But the thing is if there is one "<ul></ul>" tag inside the <ul> tag then regex stops with the inner tag...But i need the entire string between the outer two tags...
Can anyone help me?
Try this regex
String text = "<ul>My list</ul>";
String text1 = text.replaceAll("</?ul>", "");
^
? says / one time or none at all
So it will take out <ul> and </ul>
This is java language by the way. The regex may work in different languages
it will pick everything between from the first <ul> to the last </ul>.
<ul>(.*)</ul>

Regex: Detect text that exists between <p></p>

I have a plugin tag [crayon ...] that may or may not be rendered in a <p></p> block like so:
<p>This is a <b>sentence</b> [crayon ...] The Crayon [/crayon] of words. </p>
Since my tag is replaced by a <div> tag, the <p> is left disjoint from </p> and the browser closes it for me, leaving a blank paragraph above my plugin. In any case, the markup is invalid and has weird outcomes. My problem is that I need to detect if [crayon lies between a <p></p> block. I have found two ways so far:
Use <p(?:\s+[^>]*)?>(.*?)</p(?:\s+[^>]*)?> and search for [crayon in the capture.
Use <p[^>]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*\[crayon for the case of <p>...[crayon where ... doesn't contain a </p> or <p> and a similar method for a </p> after the [crayon] tag.
The second method is harder to read but will fail if a </p> is captured before my tag. It doesn't require any further processing to find my tag within the <p></p> like the first. However, the first regex is much simpler and will execute quicker. Which should I use, and is there a better way?
EDIT:
For method 2, this beast works:
<p[^<]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*((?:\[crayon[^\]]*\].*?\[/crayon\])|(?:\[crayon[^\]]*/\]))(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*</p[^<]*>
Edit with improved regex, notice I also stole your open p tag detection ;). On PHP, had to add the s modifier for multi line match:
/(?<!<!--)<p[^<]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*\[crayon.*?\].*?\[\/crayon\].*?<\/p>(?!(\s)?-->)/s
The following string was used for testing. 5 matches expected, 179 steps taken (the single regex from question took 285 steps):
<p>This is a <b>sentence</b> [crayon]...[/crayon] of words.</p>
<p class="large"> Paragraph with parameters [crayon]...[/crayon]</p>
<p>[crayon with-parameters=true]...[/crayon]</p>
<p>
Multiline paragraph [crayon]...[/crayon].
Lorem ipsum.
</p>
<p>...</p><p>[crayon]...[/crayon]</p>
<!-- <p> --> This is a <b>sentence</b> [crayon]...[/crayon] of words.<!-- </p> -->
<pizza>yummy</pizza>
Any improvement?

Writing Regex pattern for HTML tags

I'm very new to PHP writing and regular expressions. I need to write a Regex pattern that will allow me to "grab" the headlines in the following html tags:
<title>My news</title>
<h1>News</h1>
<h2 class=\"yiv1801001177first\">This is my first headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is another headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is the third headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is the last headline</h2> <p>This is a summary of a fascinating article.</p>
So I need a pattern to match all the <h2> tags. This is my first attempt at writing a pattern, and I'm seriously struggling...
/(<h+[2])>(.*?)\<\/h2>/ is what I've attempted. Help is much appreciated!
I'm not too familiar with PHP, but in cases like this it's usually easier to use XML parser (which will automatically detect <h2> as well as <h2 class="whatever"> rather than regex, which you'll have to add a bunch of special cases to. Javascript, for example has XML DOM exactly for this purpose, I'd be surprised if PHP didn't have something similar.
The easiest way to do it via regex is
#<h2\b[^>]*>(.*?)</h2>#is
This will match any h2 tag and capture its contents in backreference $1. I've used # as a regex delimiter to avoid escaping the / later on in the regex, and the is options to make the regex case-insensitive and to allow newlines within the tag's contents.
There are circumstances where this regex will fail, though, as pointed out correctly by others in this thread.
I have only checked in RegexBuddy, there following regex works:
<h2.*</h2>

How to use Perl Regex to detect <p> inside another <p>

I am trying to parse a "wrong html" to fix it using perl regex.
The wrong html is the following: <p>foo<p>bar</p>foo</p>
I would like perl regex to return me the : <p>foo<p>
I tried something like: '|(<p\b[^>]*>(?!</p>)*?<p[^>]*>)|'
with no success because I cannot repeat (?!</p>)*?
Is there a way in Perl Regex to say all charactère except the following sequence (in my case </p>)
Try something like:
<p>(?:(?!</?p>).)*</p>(?!(?:(?!</?p>).)*(<p>|$))
A quick break down:
<p>(?:(?!</?p>).)*</p>
matches <p> ... </p> that does not contain either <p> and </p>. And the part:
(?!(?:(?!</?p>).)*(<p>|$))
is "true" when looking ahead ((?! ... )) there is no <p> or the end of the input ((<p>|$)), without any <p> and </p> in between ((?:(?!</?p>).)*).
A demo:
my $txt="<p>aaa aa a</p> <p>foo <p>bar</p> foo</p> <p> bb <p>x</p> bb</p>";
while($txt =~ m/(<p>(?:(?!<\/?p>).)*<\/p>)(?!(?:(?!<\/?p>).)*(<p>|$))/g) {
print "Found: $1\n";
}
prints:
Found: <p>bar</p>
Found: <p>x</p>
Note that this regex trickery only works for <p>baz</p> in the string:
<p>foo <p>bar</p> <p>baz</p> foo</p>
<p>bar</p> is not matched! After replacing <p>baz</p>, you could do a 2nd run on the input, in which case <p>bar</p> will be matched.
I concur with Andy. Parsing nontrivial HTML with regexps is a world of pain.
Have a good look at HTML::TreeBuilder::XPath and HTML::DOM for making structural changes to HTML documents.
This regexp:
<p>(?:(?!</p>).)*?<p>
when matched with
<p>foo<p>bar</p>foo</p>
results in
<p>foo<p>
If you're trying to validate HTML then consider a module like HTML::Tidy or HTML::Lint.
Perhaps Marpa::HTML would help you. Read some interesting abilities it has on the author's blog about it. The short of it is that the parser works with the interpreter (I probably am getting some of the semantics incorrect) to figure out what should be present based on what CAN be present at a certain logical place in the code.
The examples shown therein fix similar problems as you seem to be dealing with in a much more consistent way than employing regexes which will inevitably suffer from edge cases.
Marpa::HTML comes with a command-line utility, built using the module, called html_fmt. This implements a parsing engine to fix and pretty-print html. Here is an example. If 'bad.html' contains <p>foo<p>bar</p>foo</p> then html_fmt bad.html gives:
<!-- Following start tag is replacement for a missing one -->
<html>
<!-- Following start tag is replacement for a missing one -->
<head>
</head>
<!-- Preceding end tag is replacement for a missing one -->
<!-- Following start tag is replacement for a missing one -->
<body>
<p>
foo
</p>
<!-- Preceding end tag is replacement for a missing one -->
<p>
bar
</p>
foo
<!-- Next line is cruft -->
</p>
</body>
<!-- Preceding end tag is replacement for a missing one -->
</html>
<!-- Preceding end tag is replacement for a missing one -->