Perl regexp to find an element inside an element - regex

I need to find through regular expression from <div id="class1"> to end of </div>. I may also have as many <div> within its text inside it. Please find the code below
This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example
I have tried the below code. But it gets only up to first </div> of <div id="subclass1">
Could any help me to solve this?
Code I tried to capture is:
<div id="class1">(?:(?!<\/div>).)*?</div>

Use a proper HTML parser.
use strict;
use warnings;
use feature qw( say );
use XML::LibXML qw( );
my $html = 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example';
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_string($html);
my $root = $doc->documentElement();
for my $div ($root->findnodes('//div[#id="class1"]')) {
say "[", $div->toString(), "]";
}

$ echo 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example' | sed -n 's/<div id="class1">\(.*\)<\/div>/\1/p'
This is example This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is This is example

You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.)
$re = qr{
(
<div[^>]*>
(?:(??{$re}) | [^<>]*)*
</div>
)
}x;
print "$1\n" if(/$re/o);

A lot of people always say "Use a proper HTML parser" to parse HTML and not regex. What some people fail to realize is that there are requirements to be met and those requirements might require regex.
<div id=".+?">.*</div> should work for you.
http://regexr.com?33336

Related

Why doesnt this regexp work for this html?

<div class="_1zGQT _2ugFP message-in">
<div class="-N6Gq">
<div class="copyable-text" data-pre-plain-text="[18:09, 3.6.2019] Лера сестра: ">
<div class="_12pGw">
<div class="_3X58t selectable-text invisible-space copyable-text">
<span class="_2ZDCk">
<img crossorigin="anonymous" src="URL" alt="😆" draggable="false" class="_298rb _2FANH selectable-text invisible-space copyable-text" data-plain-text="😆" style="visibility: visible;">
</span>
</div>
</div>
</div>
</div>
</div>
Ive try to get with this code:
soup.find('div', class_=re.compile('^selectable-text invisible-space copyable-text'))
All i got: None.
The problem is that part of the class (_3X58t ) is changing.
This would be likely due to using ^ anchor, which we could modify to:
soup.find('div', class_=re.compile('selectable-text invisible-space copyable-text'))
or we might try this expression for the divs:
(.+?selectable-text invisible-space copyable-text)
Demo
I would first see if a single class, from the compound class list, could be used e.g.
soup.select_one('.selectable-text')
Else combine classes
soup.select_one('[class$="selectable-text invisible-space copyable-text"]')
Rather than resorting to regex.

Grab contents of div with regex in Powershell

I have a directory of similar structured HTML files (two examples given):
File-1.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>bar</p></div></div>
<div class="baz">baz</div>
</body>
</html>
File-2.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>apple<br>banana</p></div></div>
<div class="baz">baz</div>
</body>
</html>
I am trying to create a Powershell script to return the contents of the bar div, stripped from all html:
For File-1.html: bar
For File-2.html: apple banana
I now have:
$directory = "C:\Users\Public\Documents\Sandbox\HTML"
foreach ($file in Get-ChildItem($directory))
{
$content = Get-Content $file.fullname
$test = [regex]::matches($content, '(?i)<div class="bar">(.*)</div>')
echo $test[0]
}
This returns however <div class="bar"><div><p>bar</p></div></div><div class="baz">baz</div>. In other words, the regex does not stop until the last </div>. How can I let it only grab what in the <div class="bar"> div?
By default, quantifers are greedy. They will try to match as much as possible still allowing the remainder of the regular expression to match. Use *? for a non-greedy match meaning "zero or more — preferably as few as possible".
(?si)<div class="bar">(.*?)</div>

phpQuery returning wrong results with regex

$html = '<html>
<body>
<div id="dupe_1">1
<div class="dupe_1.1">1.1</div>
<div id="dupe_1.2">1.2</div>
</div>
<div id="dupe_2">2
<div class="dupe_2.1">2.1</div>
<div id="dupe_2.2">2.2</div>
<div>extra</div>
</div>
</body>
</html>';
$html = phpQuery::newDocumentHTML($html);
$node = pq('div:regex(id,^dupe_\d+$)',$html);
echo count($node);
This returns 7 that is all divs. It should return only 2 divs (dupe_1 and dupe_2)
I would avoid doing things like that, you should be able to get those with css:
[id*=dupe_]:not([id*="."])

Regular expression negates the expression

im using pcre RegExp engine , and i have string that looks like this :
<h3 class="description">Description</h3> <div class="wrapper"> dddsome string blah blahddssssseeeee <div class="empty"> </div></div> </div> </div>
and regexp that works fine and cpture the string "dddsome string blah blahddssssseeeee"
that looks like this :
<\s*h3\s*class="*.+?"\s*>.*?</\s*h3>.+?<\s*div.+?class\s*="wrapper"\s*>(.+?)<\s*div\s*class="empty">
now some time i have the Almost the the same pattern of string that looks like this not the div class="aplus" tag , when this tag appear i want the regexp above to fail to match the all string .
<h3 class="description">Description</h3> <div class="wrapper"> <div class="aplus"> dddsome string blah blahddssssseeeee <div class="empty"> </div></div> </div>
try this
<div.*>(.*)<div.*>
but use beautiful-soup for easy better web scraping

What is wrong with my regular expression?

How can i go about getting the value eg
<div class="detail"> Hello </div>
<div class="detail"> World </div>
string x = " <div class="results-list clearfix">
<div class="detail"> Hello
</div>
</div>
<div class="results-list clearfix">
<div class="detail"> World
</div>
</div>
";
String pattern = #"<div class=""results-list clearfix"">(?<Content>[^<]*)</div>";
Regex rx = new Regex(pattern,RegexOptions.Multiline);
Match m = rx.Match(x);
while (m.Success)
{
string zz = m.Groups["Content"].Value;
m = m.NextMatch();
}
I think this is your problem ""results-list clearfix"". As you are using a literal string, you can remove the extra "'s.
It is a bad idea to use regular expressions for this kind of parsing. Use an XML parser for this particular scenario. I suggest LINQ to XML, i.e. XElement.Parse(...)
Do not forget to wrap you html in a single root element though.
Try this pattern with SingleLine option:
string pattern = "<div\\sclass=\"results-list clearfix\">\\s*(?<Content><div[^>]*>.*?</div>)"