What is wrong with my regular expression? - regex

How can i go about getting the value eg
<div class="detail"> Hello </div>
<div class="detail"> World </div>
string x = " <div class="results-list clearfix">
<div class="detail"> Hello
</div>
</div>
<div class="results-list clearfix">
<div class="detail"> World
</div>
</div>
";
String pattern = #"<div class=""results-list clearfix"">(?<Content>[^<]*)</div>";
Regex rx = new Regex(pattern,RegexOptions.Multiline);
Match m = rx.Match(x);
while (m.Success)
{
string zz = m.Groups["Content"].Value;
m = m.NextMatch();
}

I think this is your problem ""results-list clearfix"". As you are using a literal string, you can remove the extra "'s.

It is a bad idea to use regular expressions for this kind of parsing. Use an XML parser for this particular scenario. I suggest LINQ to XML, i.e. XElement.Parse(...)
Do not forget to wrap you html in a single root element though.

Try this pattern with SingleLine option:
string pattern = "<div\\sclass=\"results-list clearfix\">\\s*(?<Content><div[^>]*>.*?</div>)"

Related

How to find hashtag in HTML which ends with space or <? [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 3 years ago.
I want to find hashtags in HTML by Regex.
<div class="details">
<p>#hashtag1 #hashtag2 #hashtag3</p>
<p>#hashtag4 #hashtag5 #hashtag6</p>
...
</div>
When I use this Regex:
var details = $('.details').html();
details = details.replace(/#(\S*)/g,'#$1');
$('.details').html(details);
It returns:
<div class="details">
<p>
<a href="/?hashtag=hashtag1>#hashtag1</a>
<a href="/?hashtag=hashtag2>#hashtag2</a>
<a href="/?hashtag=hashtag3>#hashtag3</p></a>
...
</div>
How can I get this?
<div class="details">
<p>
<a href="/?hashtag=hashtag1>#hashtag1</a>
<a href="/?hashtag=hashtag2>#hashtag2</a>
<a href="/?hashtag=hashtag3>#hashtag3</a>
</p>
...
</div>
You can use this pattern /#\w+/g
var input = '<p>#hashtag1 #hashtag2 #hashtag3</p>';
console.log(input.match(/#\w+/g));
\w means: A word character is a character from a-z, A-Z, 0-9, including the _ (underscore) character.
Update:
var details = $('.details').html();
details = details.replace(/#(\w+)/g,'#$1');
$('.details').html(details);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div class="details">
<p>#hashtag1 #hashtag2 #hashtag3</p>
</div>
Update 2:
var details = $('.details').html();
var pattern = /#([\p{L}\p{N}]+)/gu;
details = details.replace(pattern,'#$1');
$('.details').html(details);
// for testing
console.log(details.match(pattern));
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div class="details">
<p>#nguyễn1 #nhựt2 #tân3</p>
</div>
#Tân gave me the answer which I wanted exactly, but I found another way to solve this problem with jQuery.
If I replace the javascript(jQuery) part like this:
var $targets = $('.details p');
$('.details p').each(function () {
var text = $(this).text();
text = text.replace(/#(\S*)/g,'#$1');
$(this).html(text);
});
This solves the problem.

Why doesnt this regexp work for this html?

<div class="_1zGQT _2ugFP message-in">
<div class="-N6Gq">
<div class="copyable-text" data-pre-plain-text="[18:09, 3.6.2019] Лера сестра: ">
<div class="_12pGw">
<div class="_3X58t selectable-text invisible-space copyable-text">
<span class="_2ZDCk">
<img crossorigin="anonymous" src="URL" alt="😆" draggable="false" class="_298rb _2FANH selectable-text invisible-space copyable-text" data-plain-text="😆" style="visibility: visible;">
</span>
</div>
</div>
</div>
</div>
</div>
Ive try to get with this code:
soup.find('div', class_=re.compile('^selectable-text invisible-space copyable-text'))
All i got: None.
The problem is that part of the class (_3X58t ) is changing.
This would be likely due to using ^ anchor, which we could modify to:
soup.find('div', class_=re.compile('selectable-text invisible-space copyable-text'))
or we might try this expression for the divs:
(.+?selectable-text invisible-space copyable-text)
Demo
I would first see if a single class, from the compound class list, could be used e.g.
soup.select_one('.selectable-text')
Else combine classes
soup.select_one('[class$="selectable-text invisible-space copyable-text"]')
Rather than resorting to regex.

How to make a non-greedy regex for following?

I have something like this:
...
<div class="viewport viewport_h" style = "overflow: hidden;" >
<div id="THIS" class="overview overview_h">
<ul>
<li>some txt to be captured</li>
<li>some txt to be captured</li>
<li>some txt to be captured</li>
</ul>
<div>
" some text to be captured"
</div>
</div>
</div>
"some text not to be captured"
</div>
<div class="scrollbar_h">
<div class="track_h"></div>
...
I want to capture everything inside div with id=THIS. I'm using somthing like:
#<div class="viewport viewport_h" style = "overflow: hidden;" >\s*<div class="overview overview_h">\s*(?:<ul>)?([\s\d\w<>\/()="-:;‘’!,:]+)(?:</div>)+?#
The last (?:</div>)+? is to make it non-greedy for further "</div>" but that doesn't work and captuers all other following </div>. :(
As said in comments regex is not a proper way for parsing (?:X|H)TML documents.
Let consider your example one straight way for that is following regex :
<div[^>]*id="THIS"[^>]*>(.*?)</div>
DEMO
That will match following text :
<ul>
<li>some txt to be captured</li>
<li>some txt to be captured</li>
<li>some txt to be captured</li>
</ul>
<div>
" some text to be captured"
</div>
As you can see its not the proper result as you need another </div> so you need to count the open divs to be able to detect the closing divs
that its all based on the language you are using.
Now in this case if you want to create a none-greedy ending dive you need to put a dot before + like following :
<div[^>]*id="THIS"[^>]*>(.*?)(</div>).+?
DEMO
Now it will match another </div> but still its hard for regex to detect the true result (its more complicated for another situation).and it's the reason that the proper way for parsing (?:X|H)TML is using a (?:X|H)TML Parser

Perl regexp to find an element inside an element

I need to find through regular expression from <div id="class1"> to end of </div>. I may also have as many <div> within its text inside it. Please find the code below
This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example
I have tried the below code. But it gets only up to first </div> of <div id="subclass1">
Could any help me to solve this?
Code I tried to capture is:
<div id="class1">(?:(?!<\/div>).)*?</div>
Use a proper HTML parser.
use strict;
use warnings;
use feature qw( say );
use XML::LibXML qw( );
my $html = 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example';
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_string($html);
my $root = $doc->documentElement();
for my $div ($root->findnodes('//div[#id="class1"]')) {
say "[", $div->toString(), "]";
}
$ echo 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example' | sed -n 's/<div id="class1">\(.*\)<\/div>/\1/p'
This is example This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is This is example
You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.)
$re = qr{
(
<div[^>]*>
(?:(??{$re}) | [^<>]*)*
</div>
)
}x;
print "$1\n" if(/$re/o);
A lot of people always say "Use a proper HTML parser" to parse HTML and not regex. What some people fail to realize is that there are requirements to be met and those requirements might require regex.
<div id=".+?">.*</div> should work for you.
http://regexr.com?33336

Regular expression negates the expression

im using pcre RegExp engine , and i have string that looks like this :
<h3 class="description">Description</h3> <div class="wrapper"> dddsome string blah blahddssssseeeee <div class="empty"> </div></div> </div> </div>
and regexp that works fine and cpture the string "dddsome string blah blahddssssseeeee"
that looks like this :
<\s*h3\s*class="*.+?"\s*>.*?</\s*h3>.+?<\s*div.+?class\s*="wrapper"\s*>(.+?)<\s*div\s*class="empty">
now some time i have the Almost the the same pattern of string that looks like this not the div class="aplus" tag , when this tag appear i want the regexp above to fail to match the all string .
<h3 class="description">Description</h3> <div class="wrapper"> <div class="aplus"> dddsome string blah blahddssssseeeee <div class="empty"> </div></div> </div>
try this
<div.*>(.*)<div.*>
but use beautiful-soup for easy better web scraping