Id' like to replace text with angle bracket as follows:
<p> <b id="docs-guid-785896d2-1" >Choose </span> <span style="font-size: 15px; ">barren</span> <span > passage.</span></b> </p>\r\n', <b id="docs-guid-785896d2-6" > <span >empty</span></b> </p>\r\n\r\n<div> </div>\r\n', '<p> <b id="docs-guid-785896d2-665" > <span >wheat</span></b> </p>\r\n'
all data is one line.
and i tried to remove b-tag like "<b id="docs-guid-785896d2-1" > xxxx </b>" => xxxx
i used "<b id="docs-guid-(.*)" >(.*)</b>" & "\2" to remove that tag, but only one string founded (of all 3)...
could you somebody help me to find & replace all 3 pairs..
thanks in advance.
Use the lazy version of (.*) by adding a question mark:
<b id="docs-guid-(.*?)" >(.*?)</b>
^ ^
Otherwise you'll match too much and the replace will remove more than necessary.
Or better yet, use negated class for some more efficiency:
<b id="docs-guid-[^"]+" >(.*?)</b>
Here, replace by $1
Related
I would like to use regex to search for all instances of a footer in a epub like the following sample:
<p class="calibre1">2 <> GENERAL INTRODUCTION </p>
of the more general format:
<p class="calibre1">[page number from 1-1000][" <>"][Title of section]</p>
My goal is to use calibre's regex to find all instances of that footer and delete them but I've tried these expressions and none of them work to even find the one above example:
<p class="calibre1">[0-9] <>[^>] </p>
<p class="calibre1">[0-9] <> [\w] </p>
and even the general:
<p class="calibre1">[\w--[\d_]]</p>
<p class="calibre1">[0-9] [.]</p>
<p class="calibre1">[0-9] *[.]</p>
<p class="calibre1">[0-9][*.]</p>
I'm new to regex and am pulling my hair out. Please help with my (mis)understanding.
This should work for what you want:
^<p[ \t]*class="calibre1">[0-9]+[^<]*<>[^<]*<[/]p>$
Please try this:
^<p class="calibre1">\d{1,4}.*</p>$
^ - Anchor to the start of the line
<p class="calibre1"> - Actual text to match
\d{1,4} - match 1 to 4 digits
.* - then zero or more characters
<\p> - until the closing tag
$ - anchored to the end of the line
I am currently in a bit of a pickle with JS Regex. From the following code, we need to extract the content inside the div or span:
<span class="code">
!(true ^ false)
</span>
<span class="code">
(true ^ false)
</span>
So we need to match both !(true ^ false) and (true ^ false)
I came up with the following regex: /<(div|span) class="code">([\s\S]+)<\/(div|span)>/im. This works when there is only one div or span in the to be matched text. However, in the situation outlined above, the match is:
! (true ^ false)
</span>
<span class="code">
!(true ^ false)
So basically it only takes the opening and the ending tag. How do I fix this ?
Should fix it by matching <div> with </div> and <span> with </span> and making regex lazy using ?
Regex: <(div|span) class="code">([\s\S]+?)<\/\1>
Explanation:
Matching <div> with </div> should be done by using back-referencing for first captured group using \1.
Made regex to match minimum tags using ?.
Regex101 Demo
All you really need is to put a ? behind your + to make the match non-greedy and then add the g modifier to continue searching after the first match:
<(div|span) class="code">\s*([\s\S]+?)<\/(div|span)>
Demo
From the following html script:
<p style="line-height:0;text-align:left">
<font face="Arial">
<span style="font-size:10pt;line-height:15px;">
<br />
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[designation]
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
</p>
I want to extract the following part
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
I tried this regular expression :
<font.*?font>
this could extract separatly two matches, but how to specify that I want that which contains [] ?
Thank you
The way with Html Agility Pack:
using HtmlAgilityPack;
...
string htmlText = #"<p style=""line-height:0;text-align:left"">
...";
HtmlDocument html = new HtmlDocument();
html.LoadHtml(htmlText);
HtmlNode doc = html.DocumentNode;
HtmlNodeCollection nodes = doc.SelectNodes("//font[.//text()[contains(substring-after(., '['), ']')]]");
if (nodes != null)
{
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.OuterHtml);
}
}
In general, you shouldn't use regexes for HTML—there are generally many much better ways to do it. However, in some isolated cases, it works perfectly fine. Assuming this is one of those cases, here's how to do it with regex.
Making regexes is often easy when you think of it this way: write what you want to match, and then replace parts of it with regex as necessary.
We want to match
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
We don't care what face="AR BLANCA"> <span style="font-size:20pt;line-height:30px;">, désignation, and </span> are, so replace them with .*.
<font .*[.*].*</font>
We also have to make sure that you escape all the special characters, otherwise [.*] will be mistaken for a character class.
<font .*\[.*\].*</font>
We also want to match all characters, but most of the time a . only matches non-newline characters. [\S\s] is a character class that by definition matches all characters.
<font [\S\s]*\[[\S\s]*\][\S\s]*</font>
We finally have one last problem—this regex will match from the very first <font to the last </font>. With your HTML example, making the quantifier lazy will not help it, so we need to do something else. The best way to do this that I know of is to use the trick explained here. So we replace each instance of [\S\s]* with ((?!</?font)[\S\s])*.
<font ((?!</?font)[\S\s])*\[((?!</?font)[\S\s])*\]((?!</?font)[\S\s])*</font>
Here's an online demonstration of this regex.
im trying to replace an html tag with another one using notepad++ search and replace.
i would like this:
<strong style="font-size: 1em;"><br />some text</strong>
to become this
<h3>some text</h3>
so far i have reached this:
<strong style="font-size: 1em;"\s(.*?)><br />(.*?)</strong>
and am not sure what to put inside "replace with", is this ok:
<h3>$1</h3>
?
Thanks
Try this as the replacement pattern.
<h3>\2</h3>
You can reference capture groups (between parenthesis) in the regex by \n where n is the number of the group.
The regex should be this for catching...
<strong style="font-size: 1em;"\s?(.*?)><br />(.*?)</strong>
this \s should be optional according to your html
Can anyone help me turn this into a regular expresion?
<a onclick="NavigateChat();" style="cursor:pointer;"><img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/></a>
The alt tag will change, and so might the image, but
<a onclick="NavigateChat();" style="cursor:pointer;">
will always start the string, and
</a>
will always end it.. How can I used a regex to find this?
Description
I'm not quite sure what you're looking to return, so this generic regular expression will:
find anchor tags
require the anchor tag to have an attribute onclick="navigatechat();"
require the anchor tag to have an attribute style="cursor:pointer;"
allow the attributes to be matched in any order
require the anchor tag's inner text to be only an image tag
capture the anchor tag's inner text tag in it's entirety
avoid many of the edge cases which makes pattern matching in html difficult
<a(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sonclick="NavigateChat\(\);")(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sstyle="cursor:pointer;")(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(<img\s.*?)\s*<\/a>
Example
Live Demo
Sample Text
<a onmouseover=' a=1; onclick="NavigateChat();" style="cursor:pointer;" href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href='http://InterestedURL.com' id='revSAR'><img src="YouShouldn'tFindMe.nope"></a>
<a onclick="NavigateChat();" style="cursor:pointer;"><img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/></a>
Matches
Group 0 gets the entire matched anchor tag
Group 1 gets the inner text
[0][0] = <a onclick="NavigateChat();" style="cursor:pointer;"><img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/></a>
[0][1] = <img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/>
Do you need to extract/capture certain pieces of info or just find the whole string?
My usual method for generalizing regexp is to start with the literal text and just replace elements with general placeholders...
<a onclick="NavigateChat\(\);" style="cursor:pointer;"><img src="[^"]+" width="\d+" height="\d+" border="\d+" alt="[^"]+"/></a>
This expression uses the character set [^"] which stands for "not a quote mark". If you just use .* as a wildcard, your regexp will fail if there is more than one tag present in your document. Regexps are "greedy" and would try to select ALL the text from the first tag through to the end of the last link.
Without a data sample, I can't test this for sure, but it should be close.