Regex to match HTML with or without link - regex

I would like to be able to get "Target" out of this block of HTML when it appears in a page:
<h3>
<a href="http://link"> Target
</a> </h3>
I can count on the spacing being reliably there. What I can't count on is that "Target" will always be included in an anchor tag. Sometimes, it looks like this:
<h3>
Target
</h3>
I can match the first version and extract "Target" pretty easily with this regex:
/<h3>\s+<a href=.*>\s+(.*)\s+<\/a>\s+<\/h3>/
But I'm struggling to write one that will match both. Any ideas?

Don't use regular expressions to parse HTML. It is more painful then it is worth in most cases. Use a library designed to parse HTML.
#!/usr/bin/perl
use v5.16;
use strict;
use warnings;
use HTML::TreeBuilder;
my $data = qq{<body><h3>
<a href="http://link"> Target
</a> </h3></body>
};
my $otherdata = qq{<body><h3>
Target
</h3></body>
};
my $t = HTML::TreeBuilder->new_from_content($data);
say $t->look_down(_tag => "h3")->as_text();
$t = HTML::TreeBuilder->new_from_content($otherdata);
say $t->look_down(_tag => "h3")->as_text();

Just to put my two cents in, why not use an xpath query with a decent Dom library?
//html/body/h3/text()[contains(.,'Target')
The actual query may vary depending upon your html structure.

Try this one as a regex:
<h3>\s+(<a href=.*>)?\s+(.*)\s+(<\/a>)?\s+<\/h3>
It should match both your cases.
Even though this is not a recommended way to search html, if this is what you want to try, I won't stop you.

Related

Reg Exp: Get string only if it is not between a tags

I am doing a search and replace of some terms, adding a link to these words. If these words are already part of another link, I should avoid it the replace (if not, I should end with <a href...> <a href ...> word </a> </a>, which is something I want to avoid.
I don't know if this is possible, so I'd like to know that and if in case it is, any hint. I am kind of lost. So far, I am being able only to get those words that are part of a link, but not those which exclusively are not.
Thanks!
You can do something like this:
$urls = array('word1'=> 'http://urlfor.word1.com',
'word2'=> 'http://urlfor.word2.com',
'word3'=> 'http://urlfor.word3.com');
$pattern = '~<(?:a\s.*?</a>|!--.*?(?:-->|$)|[^>]+>)(*SKIP)(*FAIL)|\b(?:word1|word2|word3)\b~sD';
$result = preg_replace_callback($pattern, function($m) use ($urls) {
return '' . $m[0] . ''; },
$html);
$urls is an associative array where keys are the words and the values are corresponding urls.
the pattern use the (*SKIP)(*FAIL) trick to skip parts that are already between link tags, inside a tag or in an html comment. (Note that you can easily extend the pattern to skip script, style and CDATA content or to deal with unclosed <a> tags )
This worked:
~<(?:a\s.*?</a>|[^>]+>)(*SKIP)(*FAIL)|\b(?:ultrices)\b~ig
adding g to get all the matches and not only the first one.

Regex select text and nearest wrapper

I have this text
<div>another words</div>
<div>
some text here
</div>
I want to get <div> element which contains 'text' word. result's here:
<div>
some text here
</div>
I can do it like this:
<div>.*text.*<\/div>
but it selects all text.
Try
<div>[^<]*text[^<]*<\/div>
To not include tags in the inner part of the match.
Also, regexp is not an ideal tool for parsing html. - Consider if your use case is better served by "proper" html parsing tools.
Edit:
If you have nested tags you are definitly leaving the area where regexp is a suitable tool. However you might be able to use negative lookahead;
<div>(.(?<!<div>))*text(.(?<!<div>))*<\/div>
This will misbehave if you need to handle nested div's. And probably in other edge cases, use at own risk.
$html = <<< EOF
<div>another words</div>
<div>
some text here
</div>
EOF;
preg_match('%<div>s+(.*?text.*?)\s+</div>%s', $html, $result);
$result = $result[1];
echo $result;
//some text here
http://ideone.com/qwFlJ8

regex for address in span tags

I need to extract an address which will change on every new page from a sample like this. So I need a regex to extract 100 E Faith Ter from the following html code snippet.
<span style="..." class="addr">100 E Faith Ter<br>
<span class="locality">Maitland</span>,
<span class="region">FL</span>
<span class="postal-code">32751</span>
</span>
I am using Javascript.
You don't specify a language, and regular expressions are pretty language agnostic, but they differ in specifying how they deal with multiple lines. In javascript: /^.*$/m selects the first line.
Having updated your question to be full HTML instead of raw text, you can use:
^\<.+?\>(.+?)\<br\>$
and retrieve the first parenthesized submatch (be sure you use the multiline option)
The Pony He Comes!!
A regex is not necessary for the whole thing. Instead, just use strip all HTML tags - if you're using PHP, strip_tags does this nicely, otherwise you can regex it replacing <[^>]+> with an empty string. You should get the plain text of the address. You can then split this on its separate lines.
Or you could just be this guy:

I need a regular expression that can match ending tags [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 6 years ago.
I need a regular expression that can match ending tags such as </something> and any and ALL data after it. Please help!
Example:
$html = '
<div id="footer">
<div class="wrap">
<strong class="logo">College</strong>
<ul><li>Emergencies</li>
<li>Contact</li>
<li>Copyright</li>
<li>Terms of Use</li>
<li>Member of The Colleges</li>
</ul><p>© 2010 College</p>
</div>
</div>
</body></html>
li>
<li>Contact</li>
<li>Copyright</li>
<li>Terms of Use</li>
<li>Member of The Colleges</li>
</ul><p>© 2010 College</p>
</div>
</div>
</body></html>';
$html = preg_replace("#</html>.*#i", '', $html);
print ($html);
You're trying to parse HTML with regular expressions. Regular expressions are inadequate for parsing HTML safely. What you need is an HTML parser. Take a look at PHP's DOM module.
Tags can be hidden inside comments, cdata, script and other places, and/or it could just be invalid. If you say its not markup of any kind, you could do something like this:
/<\/something\s*>((?:(?!<\/something\s*>)[\S\s])+)/ then peel off capture group 1 in a global loop. Don't need to capture the tag unless its a (?:something|something_else|...)
EDIT
Your example doesen't work because you are not using the /s modifier. It works in Perl as $html =~ s/<\/html>.*//s;. This $html =~ s/<\/html>[\S\s]*//; works without the /s modifier.
Change yours to #</html>[\S\s]*#i or use the /s modifier. Dot . will match any character except newline. With /s modifier it will match newline too.
and more Just tried it, use $html = preg_replace("#</html>.*#is", '', $html);
#"</[\da-zA-Z]+>.*"
or for a specific tag
#"</myTag>.*"
Making sure to set the regex options to ignore case. Although make sure something that parses xml isn't more helpful.
I don't think this will change your mind but probably regex's aren't the best way to pull ending tags out of html anyway. Jeff Atwood did a great essay about why this is not the best approach for solving this particular issue.
Parsing Html The Cthulhu Way

A regular expression question

I have content something like
<div class="c2">
<div class="c3">
<p>...</p>
</div>
</div>
What I want is to match the div.c2's inner HTML. The contents of it may vary a lot. The only problem I am facing here is that how can I make it to work so that the right closing div is taken?
You can't. This problem is unsolvable with classic regular expressions, and with most of the existing regex implementations.
However, some regex engines have special support for balanced pair matching. See, e.g., here (.NET). Though even in this case your regex will be able to parse only a subset of syntactically correct texts (e.g., what if a < /div > is embedded in a comment?). You need an HTML parser to get reliable results.
Any chance this will always be valid XHTML? If so, you'd be better off parsing it as XML than trying to regex this.
Delete the first line, delete the last line. Problem solved. No need for RegEx.
The following pattern works well with .Net RegEx implementation:
\<div class="c2"\>{[\n a-z.<>="0-9/]+}\</div\>
And we replace that with \1.
Input:
<div class="c2">
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>
</div>
Output:
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>