Perl regexp to find an element inside an element

Perl regexp to find an element inside an element - regex

I need to find through regular expression from <div id="class1"> to end of </div>. I may also have as many <div> within its text inside it. Please find the code below
This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example
I have tried the below code. But it gets only up to first </div> of <div id="subclass1">
Could any help me to solve this?
Code I tried to capture is:
<div id="class1">(?:(?!<\/div>).)*?</div>

Use a proper HTML parser.
use strict;
use warnings;
use feature qw( say );
use XML::LibXML qw( );
my $html = 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example';
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_string($html);
my $root = $doc->documentElement();
for my $div ($root->findnodes('//div[#id="class1"]')) {
say "[", $div->toString(), "]";
}

$ echo 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example' | sed -n 's/<div id="class1">\(.*\)<\/div>/\1/p'
This is example This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is This is example

You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.)
$re = qr{
(
<div[^>]*>
(?:(??{$re}) | [^<>]*)*
</div>
)
}x;
print "$1\n" if(/$re/o);

A lot of people always say "Use a proper HTML parser" to parse HTML and not regex. What some people fail to realize is that there are requirements to be met and those requirements might require regex.
<div id=".+?">.*</div> should work for you.
http://regexr.com?33336

Related

Why doesnt this regexp work for this html?

<div class="_1zGQT _2ugFP message-in">
<div class="-N6Gq">
<div class="copyable-text" data-pre-plain-text="[18:09, 3.6.2019] Лера сестра: ">
<div class="_12pGw">
<div class="_3X58t selectable-text invisible-space copyable-text">
<span class="_2ZDCk">
<img crossorigin="anonymous" src="URL" alt="😆" draggable="false" class="_298rb _2FANH selectable-text invisible-space copyable-text" data-plain-text="😆" style="visibility: visible;">
</span>
</div>
</div>
</div>
</div>
</div>
Ive try to get with this code:
soup.find('div', class_=re.compile('^selectable-text invisible-space copyable-text'))
All i got: None.
The problem is that part of the class (_3X58t ) is changing.

This would be likely due to using ^ anchor, which we could modify to:
soup.find('div', class_=re.compile('selectable-text invisible-space copyable-text'))
or we might try this expression for the divs:
(.+?selectable-text invisible-space copyable-text)
Demo

I would first see if a single class, from the compound class list, could be used e.g.
soup.select_one('.selectable-text')
Else combine classes
soup.select_one('[class$="selectable-text invisible-space copyable-text"]')
Rather than resorting to regex.

Grab contents of div with regex in Powershell

I have a directory of similar structured HTML files (two examples given):
File-1.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>bar</p></div></div>
<div class="baz">baz</div>
</body>
</html>
File-2.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>apple<br>banana</p></div></div>
<div class="baz">baz</div>
</body>
</html>
I am trying to create a Powershell script to return the contents of the bar div, stripped from all html:
For File-1.html: bar
For File-2.html: apple banana
I now have:
$directory = "C:\Users\Public\Documents\Sandbox\HTML"
foreach ($file in Get-ChildItem($directory))
{
$content = Get-Content $file.fullname
$test = [regex]::matches($content, '(?i)<div class="bar">(.*)</div>')
echo $test[0]
}
This returns however <div class="bar"><div><p>bar</p></div></div><div class="baz">baz</div>. In other words, the regex does not stop until the last </div>. How can I let it only grab what in the <div class="bar"> div?

By default, quantifers are greedy. They will try to match as much as possible still allowing the remainder of the regular expression to match. Use *? for a non-greedy match meaning "zero or more — preferably as few as possible".
(?si)<div class="bar">(.*?)</div>

phpQuery returning wrong results with regex

$html = '<html>
<body>
<div id="dupe_1">1
<div class="dupe_1.1">1.1</div>
<div id="dupe_1.2">1.2</div>
</div>
<div id="dupe_2">2
<div class="dupe_2.1">2.1</div>
<div id="dupe_2.2">2.2</div>
<div>extra</div>
</div>
</body>
</html>';
$html = phpQuery::newDocumentHTML($html);
$node = pq('div:regex(id,^dupe_\d+$)',$html);
echo count($node);
This returns 7 that is all divs. It should return only 2 divs (dupe_1 and dupe_2)

I would avoid doing things like that, you should be able to get those with css:
[id*=dupe_]:not([id*="."])

Regular expression negates the expression

im using pcre RegExp engine , and i have string that looks like this :
<h3 class="description">Description</h3> <div class="wrapper"> dddsome string blah blahddssssseeeee <div class="empty"> </div></div> </div> </div>
and regexp that works fine and cpture the string "dddsome string blah blahddssssseeeee"
that looks like this :
<\s*h3\s*class="*.+?"\s*>.*?</\s*h3>.+?<\s*div.+?class\s*="wrapper"\s*>(.+?)<\s*div\s*class="empty">
now some time i have the Almost the the same pattern of string that looks like this not the div class="aplus" tag , when this tag appear i want the regexp above to fail to match the all string .
<h3 class="description">Description</h3> <div class="wrapper"> <div class="aplus"> dddsome string blah blahddssssseeeee <div class="empty"> </div></div> </div>

try this
<div.*>(.*)<div.*>
but use beautiful-soup for easy better web scraping

What is wrong with my regular expression?

How can i go about getting the value eg
<div class="detail"> Hello </div>
<div class="detail"> World </div>
string x = " <div class="results-list clearfix">
<div class="detail"> Hello
</div>
</div>
<div class="results-list clearfix">
<div class="detail"> World
</div>
</div>
";
String pattern = #"<div class=""results-list clearfix"">(?<Content>[^<]*)</div>";
Regex rx = new Regex(pattern,RegexOptions.Multiline);
Match m = rx.Match(x);
while (m.Success)
{
string zz = m.Groups["Content"].Value;
m = m.NextMatch();
}

I think this is your problem ""results-list clearfix"". As you are using a literal string, you can remove the extra "'s.

It is a bad idea to use regular expressions for this kind of parsing. Use an XML parser for this particular scenario. I suggest LINQ to XML, i.e. XElement.Parse(...)
Do not forget to wrap you html in a single root element though.

Try this pattern with SingleLine option:
string pattern = "<div\\sclass=\"results-list clearfix\">\\s*(?<Content><div[^>]*>.*?</div>)"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl regexp to find an element inside an element - regex

You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.) $re = qr{ ( <div[^>]> (?:(??{$re}) | [^<>])* </div> ) }x; print "$1\n" if(/$re/o);

A lot of people always say "Use a proper HTML parser" to parse HTML and not regex. What some people fail to realize is that there are requirements to be met and those requirements might require regex. <div id=".+?">.*</div> should work for you. http://regexr.com?33336

Related

Why doesnt this regexp work for this html?

Grab contents of div with regex in Powershell

phpQuery returning wrong results with regex

Regular expression negates the expression

What is wrong with my regular expression?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl regexp to find an element inside an element - regex

You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.) $re = qr{ ( <div[^>]*> (?:(??{$re}) | [^<>]*)* </div> ) }x; print "$1\n" if(/$re/o);

A lot of people always say "Use a proper HTML parser" to parse HTML and not regex. What some people fail to realize is that there are requirements to be met and those requirements might require regex. <div id=".+?">.*</div> should work for you. http://regexr.com?33336

Related

Why doesnt this regexp work for this html?

Grab contents of div with regex in Powershell

phpQuery returning wrong results with regex

Regular expression negates the expression

What is wrong with my regular expression?

Categories

Resources

You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.) $re = qr{ ( <div[^>]> (?:(??{$re}) | [^<>])* </div> ) }x; print "$1\n" if(/$re/o);