Looking for regex to find footer elements - regex

I would like to use regex to search for all instances of a footer in a epub like the following sample:
<p class="calibre1">2 <> GENERAL INTRODUCTION </p>
of the more general format:
<p class="calibre1">[page number from 1-1000][" <>"][Title of section]</p>
My goal is to use calibre's regex to find all instances of that footer and delete them but I've tried these expressions and none of them work to even find the one above example:
<p class="calibre1">[0-9] <>[^>] </p>
<p class="calibre1">[0-9] <> [\w] </p>
and even the general:
<p class="calibre1">[\w--[\d_]]</p>
<p class="calibre1">[0-9] [.]</p>
<p class="calibre1">[0-9] *[.]</p>
<p class="calibre1">[0-9][*.]</p>
I'm new to regex and am pulling my hair out. Please help with my (mis)understanding.

This should work for what you want:
^<p[ \t]*class="calibre1">[0-9]+[^<]*<>[^<]*<[/]p>$

Please try this:
^<p class="calibre1">\d{1,4}.*</p>$
^ - Anchor to the start of the line
<p class="calibre1"> - Actual text to match
\d{1,4} - match 1 to 4 digits
.* - then zero or more characters
<\p> - until the closing tag
$ - anchored to the end of the line

Related

Regex, How to extract a delimited string and containing some special words?

From the following html script:
<p style="line-height:0;text-align:left">
<font face="Arial">
<span style="font-size:10pt;line-height:15px;">
<br />
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[designation]
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
</p>
I want to extract the following part
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
I tried this regular expression :
<font.*?font>
this could extract separatly two matches, but how to specify that I want that which contains [] ?
Thank you
The way with Html Agility Pack:
using HtmlAgilityPack;
...
string htmlText = #"<p style=""line-height:0;text-align:left"">
...";
HtmlDocument html = new HtmlDocument();
html.LoadHtml(htmlText);
HtmlNode doc = html.DocumentNode;
HtmlNodeCollection nodes = doc.SelectNodes("//font[.//text()[contains(substring-after(., '['), ']')]]");
if (nodes != null)
{
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.OuterHtml);
}
}
In general, you shouldn't use regexes for HTML—there are generally many much better ways to do it. However, in some isolated cases, it works perfectly fine. Assuming this is one of those cases, here's how to do it with regex.
Making regexes is often easy when you think of it this way: write what you want to match, and then replace parts of it with regex as necessary.
We want to match
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
We don't care what face="AR BLANCA"> <span style="font-size:20pt;line-height:30px;">, désignation, and </span> are, so replace them with .*.
<font .*[.*].*</font>
We also have to make sure that you escape all the special characters, otherwise [.*] will be mistaken for a character class.
<font .*\[.*\].*</font>
We also want to match all characters, but most of the time a . only matches non-newline characters. [\S\s] is a character class that by definition matches all characters.
<font [\S\s]*\[[\S\s]*\][\S\s]*</font>
We finally have one last problem—this regex will match from the very first <font to the last </font>. With your HTML example, making the quantifier lazy will not help it, so we need to do something else. The best way to do this that I know of is to use the trick explained here. So we replace each instance of [\S\s]* with ((?!</?font)[\S\s])*.
<font ((?!</?font)[\S\s])*\[((?!</?font)[\S\s])*\]((?!</?font)[\S\s])*</font>
Here's an online demonstration of this regex.

Change tag of html with regex when content has specific word in SublimeText

I have a html with:
<p class="s5">Chapter 1 – General Information</p>
<p class="s5">Section 1 – Example</p>
<p>Some text</p>
<p class="s5">Chapter 2 – Introduction</p>
and I want to replace every <p class="s5"> tag that starts with Chapter for <h1> ... </h1>.
How I peform it with regex substitution in SublimeText?
You haven't indicated which language/tool you're using, so here's a generic solution:
Search: (?<=<p class="s5">)(Chapter[^<]*)
Replace: <h1>$1</h1>
Breakdown:
(?<=<p class="s5">) is a look behind (non-consuming assertion) for <p class="s5">
(Chapter[^<]*) is text starting with Chapter and everything up to the next <
If your tool doesn't understand look behinds, you can just consume and replace the preceding input instead:
Search: <p class="s5">(Chapter[^<]*)
Replace: <p class="s5"><h1>$1</h1>
Note that languages/tool vary with back-reference syntax; the $1 may need to be \1 instead.

regex to replace HTML sorrounding tag

im trying to replace an html tag with another one using notepad++ search and replace.
i would like this:
<strong style="font-size: 1em;"><br />some text</strong>
to become this
<h3>some text</h3>
so far i have reached this:
<strong style="font-size: 1em;"\s(.*?)><br />(.*?)</strong>
and am not sure what to put inside "replace with", is this ok:
<h3>$1</h3>
?
Thanks
Try this as the replacement pattern.
<h3>\2</h3>
You can reference capture groups (between parenthesis) in the regex by \n where n is the number of the group.
The regex should be this for catching...
<strong style="font-size: 1em;"\s?(.*?)><br />(.*?)</strong>
this \s should be optional according to your html

How to match all text between two strings multiline

I'm trying to accomplish the same thing as seen here:
i.e. assuming you have a text like:
<p>something</p>
<!-- OPTIONAL -->
<p class="sdf"> some text</p>
<p> some other text</p>
<!-- OPTIONAL END -->
<p>The end</p>
What is the regex that would match:
<p class="sdf"> some text</p>
<p> some other text</p>
I've setup a live test here using:
<!-- OPTIONAL -->(.*?)<!-- OPTIONAL END -->
but it's not matching correctly. Also the accepted answer on the page didn't work for me. What am I missing?
Well unfortunately, RegExr is dependent on the JS RegExp implementation, which does not support the option to enable the flag/modifier that you need.
You are looking for the s (DotAll) modifier forcing the dot . to match newline sequences.
Live Demo on regular expressions 101
If you are using JavaScript, you can use this workaround:
/<!-- OPTIONAL -->([\S\s]*?)<!-- OPTIONAL END -->/

Regular Expression in dreamweaver

Can anyone help me turn this into a regular expresion?
<a onclick="NavigateChat();" style="cursor:pointer;"><img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/></a>
The alt tag will change, and so might the image, but
<a onclick="NavigateChat();" style="cursor:pointer;">
will always start the string, and
</a>
will always end it.. How can I used a regex to find this?
Description
I'm not quite sure what you're looking to return, so this generic regular expression will:
find anchor tags
require the anchor tag to have an attribute onclick="navigatechat();"
require the anchor tag to have an attribute style="cursor:pointer;"
allow the attributes to be matched in any order
require the anchor tag's inner text to be only an image tag
capture the anchor tag's inner text tag in it's entirety
avoid many of the edge cases which makes pattern matching in html difficult
<a(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sonclick="NavigateChat\(\);")(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sstyle="cursor:pointer;")(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(<img\s.*?)\s*<\/a>
Example
Live Demo
Sample Text
<a onmouseover=' a=1; onclick="NavigateChat();" style="cursor:pointer;" href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href='http://InterestedURL.com' id='revSAR'><img src="YouShouldn'tFindMe.nope"></a>
<a onclick="NavigateChat();" style="cursor:pointer;"><img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/></a>
Matches
Group 0 gets the entire matched anchor tag
Group 1 gets the inner text
[0][0] = <a onclick="NavigateChat();" style="cursor:pointer;"><img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/></a>
[0][1] = <img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/>
Do you need to extract/capture certain pieces of info or just find the whole string?
My usual method for generalizing regexp is to start with the literal text and just replace elements with general placeholders...
<a onclick="NavigateChat\(\);" style="cursor:pointer;"><img src="[^"]+" width="\d+" height="\d+" border="\d+" alt="[^"]+"/></a>
This expression uses the character set [^"] which stands for "not a quote mark". If you just use .* as a wildcard, your regexp will fail if there is more than one tag present in your document. Regexps are "greedy" and would try to select ALL the text from the first tag through to the end of the last link.
Without a data sample, I can't test this for sure, but it should be close.