Regex select text and nearest wrapper - regex

I have this text
<div>another words</div>
<div>
some text here
</div>
I want to get <div> element which contains 'text' word. result's here:
<div>
some text here
</div>
I can do it like this:
<div>.*text.*<\/div>
but it selects all text.

Try
<div>[^<]*text[^<]*<\/div>
To not include tags in the inner part of the match.
Also, regexp is not an ideal tool for parsing html. - Consider if your use case is better served by "proper" html parsing tools.
Edit:
If you have nested tags you are definitly leaving the area where regexp is a suitable tool. However you might be able to use negative lookahead;
<div>(.(?<!<div>))*text(.(?<!<div>))*<\/div>
This will misbehave if you need to handle nested div's. And probably in other edge cases, use at own risk.

$html = <<< EOF
<div>another words</div>
<div>
some text here
</div>
EOF;
preg_match('%<div>s+(.*?text.*?)\s+</div>%s', $html, $result);
$result = $result[1];
echo $result;
//some text here
http://ideone.com/qwFlJ8

Related

Regex to match HTML with or without link

I would like to be able to get "Target" out of this block of HTML when it appears in a page:
<h3>
<a href="http://link"> Target
</a> </h3>
I can count on the spacing being reliably there. What I can't count on is that "Target" will always be included in an anchor tag. Sometimes, it looks like this:
<h3>
Target
</h3>
I can match the first version and extract "Target" pretty easily with this regex:
/<h3>\s+<a href=.*>\s+(.*)\s+<\/a>\s+<\/h3>/
But I'm struggling to write one that will match both. Any ideas?
Don't use regular expressions to parse HTML. It is more painful then it is worth in most cases. Use a library designed to parse HTML.
#!/usr/bin/perl
use v5.16;
use strict;
use warnings;
use HTML::TreeBuilder;
my $data = qq{<body><h3>
<a href="http://link"> Target
</a> </h3></body>
};
my $otherdata = qq{<body><h3>
Target
</h3></body>
};
my $t = HTML::TreeBuilder->new_from_content($data);
say $t->look_down(_tag => "h3")->as_text();
$t = HTML::TreeBuilder->new_from_content($otherdata);
say $t->look_down(_tag => "h3")->as_text();
Just to put my two cents in, why not use an xpath query with a decent Dom library?
//html/body/h3/text()[contains(.,'Target')
The actual query may vary depending upon your html structure.
Try this one as a regex:
<h3>\s+(<a href=.*>)?\s+(.*)\s+(<\/a>)?\s+<\/h3>
It should match both your cases.
Even though this is not a recommended way to search html, if this is what you want to try, I won't stop you.

Reg Exp: Get string only if it is not between a tags

I am doing a search and replace of some terms, adding a link to these words. If these words are already part of another link, I should avoid it the replace (if not, I should end with <a href...> <a href ...> word </a> </a>, which is something I want to avoid.
I don't know if this is possible, so I'd like to know that and if in case it is, any hint. I am kind of lost. So far, I am being able only to get those words that are part of a link, but not those which exclusively are not.
Thanks!
You can do something like this:
$urls = array('word1'=> 'http://urlfor.word1.com',
'word2'=> 'http://urlfor.word2.com',
'word3'=> 'http://urlfor.word3.com');
$pattern = '~<(?:a\s.*?</a>|!--.*?(?:-->|$)|[^>]+>)(*SKIP)(*FAIL)|\b(?:word1|word2|word3)\b~sD';
$result = preg_replace_callback($pattern, function($m) use ($urls) {
return '' . $m[0] . ''; },
$html);
$urls is an associative array where keys are the words and the values are corresponding urls.
the pattern use the (*SKIP)(*FAIL) trick to skip parts that are already between link tags, inside a tag or in an html comment. (Note that you can easily extend the pattern to skip script, style and CDATA content or to deal with unclosed <a> tags )
This worked:
~<(?:a\s.*?</a>|[^>]+>)(*SKIP)(*FAIL)|\b(?:ultrices)\b~ig
adding g to get all the matches and not only the first one.

Perl regex: search of all class="" in string and save values in array

I am trying to get classes from a string in HTML document.
String for example:
<span class="bullet first">Some</span>Published <abbr class="published">Sometexthere</abbr></p>
So, what I am trying to acheive is to get all "classes" in the string (bullet, first, published).
But the problem is that it can be any amount of class="" in the string.
So, I guess there is no way to do that with one regex, I need cycle here?
No matter how you do it, it's a two step process:
Extract the values of the class attributes ("bullet first", "published").
Extract the classes from those values ("bullet", "first", "published").
XML::LibXML (which is also an HTML parser):
my #classes =
map split(' ', $_->getValue()), # Step 2
$xpc->findnodes('*/#class', $node); # Step 1
(Or maybe .//*/#class, depending on what you want.)
I am adding this to answer the part 'So, I guess there is no way to do that with one regex, I need cycle here?'
You have to use the modifier g in the regexp
my $text = '<span class="bullet first">Some</span>Published <abbr class="published">Sometexthere</abbr></p>';
while($text =~ /class\s*=\s*"([^"]+)"/g) {
print "class --> $1\n";
}
This is the result
class --> bullet first
class --> published
If you are sure the html does not contain complex data such as <p> class="abc" <\p> then looping throug a regex with the global modifier will cause it to start it the place it matched the last time.
Example
While ($_=~ /class="(.*?)"/g) {
#process class names here
#class is in $1
}
However for general use a html parser is recomended as this will process the string <p> class="abc" <\p> as containing the class abc

REGEX Pattern - How do I match upto a certain tag in html

I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/

I need a regular expression that can match ending tags [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 6 years ago.
I need a regular expression that can match ending tags such as </something> and any and ALL data after it. Please help!
Example:
$html = '
<div id="footer">
<div class="wrap">
<strong class="logo">College</strong>
<ul><li>Emergencies</li>
<li>Contact</li>
<li>Copyright</li>
<li>Terms of Use</li>
<li>Member of The Colleges</li>
</ul><p>© 2010 College</p>
</div>
</div>
</body></html>
li>
<li>Contact</li>
<li>Copyright</li>
<li>Terms of Use</li>
<li>Member of The Colleges</li>
</ul><p>© 2010 College</p>
</div>
</div>
</body></html>';
$html = preg_replace("#</html>.*#i", '', $html);
print ($html);
You're trying to parse HTML with regular expressions. Regular expressions are inadequate for parsing HTML safely. What you need is an HTML parser. Take a look at PHP's DOM module.
Tags can be hidden inside comments, cdata, script and other places, and/or it could just be invalid. If you say its not markup of any kind, you could do something like this:
/<\/something\s*>((?:(?!<\/something\s*>)[\S\s])+)/ then peel off capture group 1 in a global loop. Don't need to capture the tag unless its a (?:something|something_else|...)
EDIT
Your example doesen't work because you are not using the /s modifier. It works in Perl as $html =~ s/<\/html>.*//s;. This $html =~ s/<\/html>[\S\s]*//; works without the /s modifier.
Change yours to #</html>[\S\s]*#i or use the /s modifier. Dot . will match any character except newline. With /s modifier it will match newline too.
and more Just tried it, use $html = preg_replace("#</html>.*#is", '', $html);
#"</[\da-zA-Z]+>.*"
or for a specific tag
#"</myTag>.*"
Making sure to set the regex options to ignore case. Although make sure something that parses xml isn't more helpful.
I don't think this will change your mind but probably regex's aren't the best way to pull ending tags out of html anyway. Jeff Atwood did a great essay about why this is not the best approach for solving this particular issue.
Parsing Html The Cthulhu Way