Regular expression negates the expression - regex

im using pcre RegExp engine , and i have string that looks like this :
<h3 class="description">Description</h3> <div class="wrapper"> dddsome string blah blahddssssseeeee <div class="empty"> </div></div> </div> </div>
and regexp that works fine and cpture the string "dddsome string blah blahddssssseeeee"
that looks like this :
<\s*h3\s*class="*.+?"\s*>.*?</\s*h3>.+?<\s*div.+?class\s*="wrapper"\s*>(.+?)<\s*div\s*class="empty">
now some time i have the Almost the the same pattern of string that looks like this not the div class="aplus" tag , when this tag appear i want the regexp above to fail to match the all string .
<h3 class="description">Description</h3> <div class="wrapper"> <div class="aplus"> dddsome string blah blahddssssseeeee <div class="empty"> </div></div> </div>

try this
<div.*>(.*)<div.*>
but use beautiful-soup for easy better web scraping

Related

Why doesnt this regexp work for this html?

<div class="_1zGQT _2ugFP message-in">
<div class="-N6Gq">
<div class="copyable-text" data-pre-plain-text="[18:09, 3.6.2019] Лера сестра: ">
<div class="_12pGw">
<div class="_3X58t selectable-text invisible-space copyable-text">
<span class="_2ZDCk">
<img crossorigin="anonymous" src="URL" alt="😆" draggable="false" class="_298rb _2FANH selectable-text invisible-space copyable-text" data-plain-text="😆" style="visibility: visible;">
</span>
</div>
</div>
</div>
</div>
</div>
Ive try to get with this code:
soup.find('div', class_=re.compile('^selectable-text invisible-space copyable-text'))
All i got: None.
The problem is that part of the class (_3X58t ) is changing.
This would be likely due to using ^ anchor, which we could modify to:
soup.find('div', class_=re.compile('selectable-text invisible-space copyable-text'))
or we might try this expression for the divs:
(.+?selectable-text invisible-space copyable-text)
Demo
I would first see if a single class, from the compound class list, could be used e.g.
soup.select_one('.selectable-text')
Else combine classes
soup.select_one('[class$="selectable-text invisible-space copyable-text"]')
Rather than resorting to regex.

Sublime Text Regex Search for alphanumeric string, not working..

I'm trying to replace a common theme used in hundreds of pages in my project:
<div id="PageTitle"> (Page title as a string) </div>
And the title varies each page. I want to replace it with
<div class="row">
<div class="col-md-12 col-sm-12">
<h3><?= $pageTitle?></h3>
</div>
</div>
I've tried searching with <div id="PageTitle">/^\w+$/</div>, and <div id="PageTitle">"^[a-zA-Z0-9_]*$"</div> with no luck. Any ideas?
You are almost there. Looks like you got the pattern from somewhere else. ^ and $ are starting and ending anchors so they match with the start and end of an input so you should probably get rid of them.
Next if your page title is only going to contain alphanumeric characters (no spaces too) then \w is fine, else you might want to use . instead.
<div id="PageTitle">\w+<\/div>
For a title containing any character:
<div id="PageTitle">.+?<\/div>
Here's a demo
Hope this helps!
Try this one as well, I think its pretty strict:
<div id="PageTitle">(?:(?!<\/div>).)+<\/div>
Or even:
<div id="PageTitle">[\s\S]*?<\/div>

preg_replace regular expression to replace link within a particular tags

I need one help, i want to replace the href link to my link within a particular div class only.
<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb">
<b class="icon-star"></b> N/A
</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>
</div>
Here i want to change http://oldsite.com/ to http://newsite.com/?id=
i want these href links like
<a href="http://newsite.com/?id=the-fate-of-the-furious">
Please help me with preg_replace regular expression.
Thanks
this may help you
$content = get_the_content();
$pattern = "/(?<=href=(\"|'))[^\"']+(?=(\"|'))/";
$newurl = get_permalink();
$content = preg_replace($pattern,$newurl,$content);
echo $content;
Lookbehinds are too expensive, use \K to start the fullstring match and avoid a capture group.
<a href="\K[^"]+\/ This pattern will be very efficient. I should state that this pattern will match ALL <a href urls. It also matches greedily until it finds the last / in the url -- I assume this is okay by your input sample.
Pattern Demo
Code (PHP Demo):
$in='<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb"><b class="icon-star"></b> N/A</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>';
echo preg_replace('/<a href="\K[^"]+\//','http://newsite.com/?id=',$in);
Output:
<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb"><b class="icon-star"></b> N/A</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>

How can I extract URLs from html content with ruby regexp?

Lets go directly with an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content i want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with ruby regex
This should give you some insight of how to do it.
https://regex101.com/r/wD4oT8/2
javascript:show\(\'(.*?)'.*?\'([^\']*)\'\) will capture the first argument as $1, last part within ' as $2, so you get what you want by substituting as $2/$1.
That's the regex part of it, and, of course, you can adjust the regex as you see fit, for example, to include the usage of " (javascript:show\((?:\'|\")(.*?)(?:\'|\").*?\'([^\'\"]*)(?:\'|\")\) or allow only with 3 arguments.
/yourregex/.match(yourstring) will extract the information you need.

What is wrong with my regular expression?

How can i go about getting the value eg
<div class="detail"> Hello </div>
<div class="detail"> World </div>
string x = " <div class="results-list clearfix">
<div class="detail"> Hello
</div>
</div>
<div class="results-list clearfix">
<div class="detail"> World
</div>
</div>
";
String pattern = #"<div class=""results-list clearfix"">(?<Content>[^<]*)</div>";
Regex rx = new Regex(pattern,RegexOptions.Multiline);
Match m = rx.Match(x);
while (m.Success)
{
string zz = m.Groups["Content"].Value;
m = m.NextMatch();
}
I think this is your problem ""results-list clearfix"". As you are using a literal string, you can remove the extra "'s.
It is a bad idea to use regular expressions for this kind of parsing. Use an XML parser for this particular scenario. I suggest LINQ to XML, i.e. XElement.Parse(...)
Do not forget to wrap you html in a single root element though.
Try this pattern with SingleLine option:
string pattern = "<div\\sclass=\"results-list clearfix\">\\s*(?<Content><div[^>]*>.*?</div>)"