Regular Expression in dreamweaver - regex

Can anyone help me turn this into a regular expresion?
<a onclick="NavigateChat();" style="cursor:pointer;"><img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/></a>
The alt tag will change, and so might the image, but
<a onclick="NavigateChat();" style="cursor:pointer;">
will always start the string, and
</a>
will always end it.. How can I used a regex to find this?

Description
I'm not quite sure what you're looking to return, so this generic regular expression will:
find anchor tags
require the anchor tag to have an attribute onclick="navigatechat();"
require the anchor tag to have an attribute style="cursor:pointer;"
allow the attributes to be matched in any order
require the anchor tag's inner text to be only an image tag
capture the anchor tag's inner text tag in it's entirety
avoid many of the edge cases which makes pattern matching in html difficult
<a(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sonclick="NavigateChat\(\);")(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sstyle="cursor:pointer;")(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(<img\s.*?)\s*<\/a>
Example
Live Demo
Sample Text
<a onmouseover=' a=1; onclick="NavigateChat();" style="cursor:pointer;" href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href='http://InterestedURL.com' id='revSAR'><img src="YouShouldn'tFindMe.nope"></a>
<a onclick="NavigateChat();" style="cursor:pointer;"><img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/></a>
Matches
Group 0 gets the entire matched anchor tag
Group 1 gets the inner text
[0][0] = <a onclick="NavigateChat();" style="cursor:pointer;"><img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/></a>
[0][1] = <img src="images/online-chat.jpg" width="350" height="150" border="0" alt="Title Loans Novato - Online Chat"/>

Do you need to extract/capture certain pieces of info or just find the whole string?
My usual method for generalizing regexp is to start with the literal text and just replace elements with general placeholders...
<a onclick="NavigateChat\(\);" style="cursor:pointer;"><img src="[^"]+" width="\d+" height="\d+" border="\d+" alt="[^"]+"/></a>
This expression uses the character set [^"] which stands for "not a quote mark". If you just use .* as a wildcard, your regexp will fail if there is more than one tag present in your document. Regexps are "greedy" and would try to select ALL the text from the first tag through to the end of the last link.
Without a data sample, I can't test this for sure, but it should be close.

Related

Looking for regex to find footer elements

I would like to use regex to search for all instances of a footer in a epub like the following sample:
<p class="calibre1">2 <> GENERAL INTRODUCTION </p>
of the more general format:
<p class="calibre1">[page number from 1-1000][" <>"][Title of section]</p>
My goal is to use calibre's regex to find all instances of that footer and delete them but I've tried these expressions and none of them work to even find the one above example:
<p class="calibre1">[0-9] <>[^>] </p>
<p class="calibre1">[0-9] <> [\w] </p>
and even the general:
<p class="calibre1">[\w--[\d_]]</p>
<p class="calibre1">[0-9] [.]</p>
<p class="calibre1">[0-9] *[.]</p>
<p class="calibre1">[0-9][*.]</p>
I'm new to regex and am pulling my hair out. Please help with my (mis)understanding.
This should work for what you want:
^<p[ \t]*class="calibre1">[0-9]+[^<]*<>[^<]*<[/]p>$
Please try this:
^<p class="calibre1">\d{1,4}.*</p>$
^ - Anchor to the start of the line
<p class="calibre1"> - Actual text to match
\d{1,4} - match 1 to 4 digits
.* - then zero or more characters
<\p> - until the closing tag
$ - anchored to the end of the line

Capture specific first matches in regex

I have this text and want to capture each match of the letter 'ñ' under the html href attribute. I want it to match the 'ñ' in both niño.html and niña.html, but not the ones in Niño and Niña:
<a href='niño.html'>Niño</a> <a href='niña.html'>Niña</a>
I tried this but it also matches Niño:
ñ(.*?\.html'>)+?
When replacing with n\1, it gives:
<a href='nino.html'>Nino</a> <a href='niña.html'>Niña</a>
What I would want the text to look like is:
<a href='nino.html'>Niño</a> <a href='nina.html'>Niña</a>
How can I do this?
when you try this the part between does not contain the single quote:
ñ([^']*?\.html'>)+?

Regex, How to extract a delimited string and containing some special words?

From the following html script:
<p style="line-height:0;text-align:left">
<font face="Arial">
<span style="font-size:10pt;line-height:15px;">
<br />
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[designation]
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
</p>
I want to extract the following part
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
I tried this regular expression :
<font.*?font>
this could extract separatly two matches, but how to specify that I want that which contains [] ?
Thank you
The way with Html Agility Pack:
using HtmlAgilityPack;
...
string htmlText = #"<p style=""line-height:0;text-align:left"">
...";
HtmlDocument html = new HtmlDocument();
html.LoadHtml(htmlText);
HtmlNode doc = html.DocumentNode;
HtmlNodeCollection nodes = doc.SelectNodes("//font[.//text()[contains(substring-after(., '['), ']')]]");
if (nodes != null)
{
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.OuterHtml);
}
}
In general, you shouldn't use regexes for HTML—there are generally many much better ways to do it. However, in some isolated cases, it works perfectly fine. Assuming this is one of those cases, here's how to do it with regex.
Making regexes is often easy when you think of it this way: write what you want to match, and then replace parts of it with regex as necessary.
We want to match
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
We don't care what face="AR BLANCA"> <span style="font-size:20pt;line-height:30px;">, désignation, and </span> are, so replace them with .*.
<font .*[.*].*</font>
We also have to make sure that you escape all the special characters, otherwise [.*] will be mistaken for a character class.
<font .*\[.*\].*</font>
We also want to match all characters, but most of the time a . only matches non-newline characters. [\S\s] is a character class that by definition matches all characters.
<font [\S\s]*\[[\S\s]*\][\S\s]*</font>
We finally have one last problem—this regex will match from the very first <font to the last </font>. With your HTML example, making the quantifier lazy will not help it, so we need to do something else. The best way to do this that I know of is to use the trick explained here. So we replace each instance of [\S\s]* with ((?!</?font)[\S\s])*.
<font ((?!</?font)[\S\s])*\[((?!</?font)[\S\s])*\]((?!</?font)[\S\s])*</font>
Here's an online demonstration of this regex.

Reg exp: string NOT in pattern

I have problems constructing a reg exp. I think I should use lookahead/behind but I just don't make it.
I want to make a reg-exp that catches all HTML tags that do NOT contain a string ('rabbit').
For example, the following tags should be matched
<a XXX> <span yyy> </div x zz> </li qwerty=ab cd> <div hello=stackoverflow>
But not the following
<a XXrabbitX> <span yyyrabbit> </div xrabbitzz> </li rabbit=abcd hippo=9876> <div hello=rabbit>
(My next step is to make make a substitution so that the word rabbit enters the tags, but that will hopefully come easy.)
(I use PHP5-preg_replace.)
Thanks.
I guess you're matching the HTML tags with a regex something like this:
/<[^>]*>/
You can add a negative look-ahead assertion in there to assert that "rabbit" cannot be found in the tag:
/<(?![^>]*rabbit)[^>]*>/

How to match second <a> tag in this string

I have a HTML fragment which contains two anchor tags in various parts of the HTML.
<span id="ctl00_PlaceHolderTitleBreadcrumb_ContentMap">
<span><a class="ms-sitemapdirectional" href="/">My Site</a></span>
<span> > </span>
<span><a class="ms-sitemapdirectional" href="/Lists/Announcements/AllItems.aspx">Announcements</a></span>
<span> > </span>
<span class="ms-sitemapdirectional">Settings</span>
</span>
I'm looking to write a regular expression that will return the second anchor tag, which has 'Announcements' as it's text. In trying to write an expression, I keep getting both anchor tags returned - but I'm only interested in the second tag.
Is it possible to match the second tag only?
EDIT:
I will always know that I'm looking for an anchor tag which has 'Announcements' in it's text, if that helps.
Parse the fragment into a DOM. Use XPath to issue:
(//a)[2]
Done.
like
/<a.+?>[^<>]*Announcements[^<>]*</a>/
PS. regular expression are the wrong tool for parsing html
/(<a.*?<\/a>).*?(<a.*?<\/a>)/
$1 matches the first tag, $2 matches the second
you don't have to use complicated regular expression for this if you don't want to. since you want to get anchors, and usually anchors has ending tags </a>, you can use your favourite language and do splits on </a> for each line.
eg pseudocode
for each line in htmlfile
do
var=split line on </a>
for each item in var
do
if item has "Announcement" then
print "found"
end if
done
done
<?php
$string = '<span id="ctl00_PlaceHolderTitleBreadcrumb_ContentMap"><span><a class="ms-sitemapdirectional" href="/">My Site</a></span><span> > </span><span><a class="ms-sitemapdirectional" href="/Lists/Announcements/AllItems.aspx">Announcements</a></span><span> > </span><span class="ms-sitemapdirectional">Settings</span></span>';
$dom = new DOMDocument();
$dom->loadHTML($string);
$anchors = $dom->getElementsByTagName('a');
if ( $anchors->length ) {
$secondAnchor = $anchors->item(1);
echo innerHTML($secondAnchor->parentNode);
}
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
If you know the exact text of the element, and you know it's the last element of its kind in the fragment, you have more than enough information to match it with a regex. I suspect you're using a regex like this:
/<a\s+.*>Announcements<\/a>/s
...and the .* is matching everything between the <a of the first anchor tag and the >Announcements</a> of the second one. Switching to a non-greedy quantifier:
/<a\s+.*?>Announcements<\/a>/s
...doesn't help; a reluctant quantifier stops matching as soon as possible, but the problem here is that it starts matching too soon. You need to replace the .* with something more specific, something that can only match whatever comes between the opening <a and closing > of a single tag:
/<a\s+[^<>]+>Announcements<\/a>/
Now, when it reaches the end of the first <a> tag and doesn't see Announcements</a> it will abort that match attempt, move along and start fresh at the second <a> tag.