Regular expression to remove lines with special characters - regex

<a class='jdr' href='javascript:void(0);' onClick="return openDiv('jrtp');"></a>
<span class="jcn">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET" title='Aptech N Power Hardware & Networking' >Aptech N Power Hardware & Networkin...</a>
</span>
<section class="jrat">
<a rel="nofollow" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw"><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s0'></span></a>
<a class="jrt" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw">2 ratings</a>
<span class="jrt"> |</span>
<a class="rate_this" onclick="_ct('ratethis','lspg');" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET/writereview">Rate this</a>
</section>
<section class="jcar">
<section class="jbc">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET">
<img width="83" height="56" border="0" src="http://images.jdmagicbox.com/upload_test/ahmedabad/b4/079pxx79.xx79.110420172948.d4b4/logo/faf3f2409ed7993aaa70f848ab0bb6fb_t.jpg" class="Clogo" />
</a>
<!-- <span class="noLogo"></span> -->
<section class="jrcl">
<p>
**A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024** | View Map<br>
</p>
From the above XML data I want to extract the following---
A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024
I need help in creating a regular expression to find and remove all lines containing special characters.
I am using the following regex ----
/(\<.+?>)/g
Please help.Thanks

Try this
/(?<=\*{2})([^<>]*?)(?=\*{2})/g
it matches all content between the **.

I think you want to remove lines which are HTML tags, so try this:
/^<.*>\n/g

Related

preg_replace regular expression to replace link within a particular tags

I need one help, i want to replace the href link to my link within a particular div class only.
<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb">
<b class="icon-star"></b> N/A
</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>
</div>
Here i want to change http://oldsite.com/ to http://newsite.com/?id=
i want these href links like
<a href="http://newsite.com/?id=the-fate-of-the-furious">
Please help me with preg_replace regular expression.
Thanks
this may help you
$content = get_the_content();
$pattern = "/(?<=href=(\"|'))[^\"']+(?=(\"|'))/";
$newurl = get_permalink();
$content = preg_replace($pattern,$newurl,$content);
echo $content;
Lookbehinds are too expensive, use \K to start the fullstring match and avoid a capture group.
<a href="\K[^"]+\/ This pattern will be very efficient. I should state that this pattern will match ALL <a href urls. It also matches greedily until it finds the last / in the url -- I assume this is okay by your input sample.
Pattern Demo
Code (PHP Demo):
$in='<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb"><b class="icon-star"></b> N/A</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>';
echo preg_replace('/<a href="\K[^"]+\//','http://newsite.com/?id=',$in);
Output:
<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb"><b class="icon-star"></b> N/A</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>

Regex: removing specific style attribute in HTML

I have a problem in regex.
I want to remove attribute style html backgroud-image in tag HTML, like this:
<span style="background-image: url ("http://mantis.we.intern/custom/userfiles/image/6y0eC4vzptnIxsikHs0AJA.png"); bgcolor:'red'">11111<br />
<span style="background-image: url('https:// asd asdmantis.we.intern/custom/userfiles/image/6y0eC4vzptnIxsikHs0AJA.png')">22222
<span style="background-image:url('https://fmantis.we.intern/custom/userfiles/image/6y0eC4vzptnIxsikHs0AJA.png')">3333
<span style="background-image:url( 'https://fmantis.we.intern/custom/userfiles/image/6y0eC4vzptnIxsikHs0AJA.png')">444
<span style="background-image: url ( "http://mantis.we.intern/custom/userfiles/image/6y0eC4vzptnIxsikHs0AJA.png");">555<br />
<span style="background-image: url(xx https://trk.workexpert.net/web/include/ckeditor_432/plugins/icons.png?t=E0LB& xx); >666
And the result from regex, could be like this:
<span style=" bgcolor:'red'">11111<br />
<span style="">22222
<span style="">3333
<span style="">444
<span style="">555<br />
<span style=" >666
Thank you, before.
This regex does exactly what you need:
background-image:[^)]*[)];?
It selects the words background-image: and then everything up until the first ) with an optional ;
Regex101 Tested

regex to remove recurring instances of comment tag

Hello I want to remove all recurring instances of comment tag which occurs in a data.
Data which I am using is mentioned below
<!-- <li><a class="topitemlink" href="/About-Us/Career-Centre.aspx">Career Centre</a></li>
<li><img alt="" width="7" height="22" src="/images/common/separator.gif" /></li>-->
<li><a class="topitemlink" href="/ContactUs">Contact Us</a> <!-- <ul class="topcontactusmenu"><li>Contact Us</li><li>Contact the IR Team</li><li>Contact the Media Team</li></ul> --></li>
</ul>
</div>
<!--<img width="92" height="40" src="/ABMB/media/MyLibrary/Shared/Images/bizSmart_logo.gif" alt="" /><img width="76" height="40" src="/ABMB/media/MyLibrary/Shared/Images/sabah-run2015_top-icon.jpg" alt="" />-->
The regex I am using just captures the first instance but I want all instances to be captured.
<!--.*\s.*-->
You could use something like so: <!--.+?--> (Example here). Make sure that you have the sg flag enabled.
The s flag would allow the period character to also match new line feeds, thus allowing you to capture comments which span multiple lines.
The g flag will apply the pattern globally, that is, to the entire text.
You didn't specify the language you're using but for php you can use /<!--.*?-->/s , i.e.:
$html = '<!-- <li><a class="topitemlink" href="/About-Us/Career-Centre.aspx">Career Centre</a></li>
<li><img alt="" width="7" height="22" src="/images/common/separator.gif" /></li>-->
<li><a class="topitemlink" href="/ContactUs">Contact Us</a> <!-- <ul class="topcontactusmenu"><li>Contact Us</li><li>Contact the IR Team</li><li>Contact the Media Team</li></ul> --></li>
</ul>
</div>
<!--<img width="92" height="40" src="/ABMB/media/MyLibrary/Shared/Images/bizSmart_logo.gif" alt="" /><img width="76" height="40" src="/ABMB/media/MyLibrary/Shared/Images/sabah-run2015_top-icon.jpg" alt="" />-->';
$html = preg_replace('/<!--.*?-->/s', '', $html);
echo $html;
/*<li><a class="topitemlink" href="/ContactUs">Contact Us</a> </li>
</ul>
</div>*/
DEMO:
https://ideone.com/It6HvW
EXPLANATION:
<!--.*?-->
Options: Case sensitive; Exact spacing; Dot matches line breaks; ^$ don’t match at line breaks; Greedy quantifiers; Regex syntax only
Match the character string “<!--” literally «<!--»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character string “-->” literally «-->»

BeautifulSoup not able to parse perfectly

When I am using soup.find("h3", text="Main Address:").find_parents("section"), I am getting an output which is:
[<section class="otlnrw" itemscope="" itemtype="http://microformats.org/wiki/hCard">\n<header>\n<h3 i
temprop="name">Main Address:</h3>\n</header>\n<p>600 Dexter <abbr title="Avenue\r"><abbr title="Avenu
e\r">Ave.</abbr></abbr><br/><span class="locality">Montgomery</span>, <span class="region">AL</span>,
<span class="postal-code">36104</span></p> </section>]
Now I want to print only paragraph's text. I am not able to do that. Please tell me how can I print from here only text which is inside this paragraph of the section.
Or my HTML page is like this:
<article>
<header>
<h2 id="state-government">State Government</h2>
</header>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Official Name:</h3></header>
<p>Alaska
</p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 class="org">Governor:</h3></header>
<p>Bill Walker</p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Main Address:</h3></header>
<p>120 East 4th Street<br>
<span class="locality">Juneau</span>,
<span class="region">AK</span>,
<span class="postal-code">99801</span></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 itemprop="name">Phone Number:</h3></header>
<p class="spk tel">907-465-3708</p>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
<section>
<header><h2 id="state-agencies">State Agencies</h2></header>
<ul>
<li>Consumer Protection Offices</li>
<li>Corrections Department</li>
<li>Election Office</li>
<li>Motor Vehicle Offices</li>
<li>Surplus Property Sales</li>
<li>Travel and Tourism</li>
</ul>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
</article>
How should I get the address from it only text.
Your current code returns a list with one element. To get the <p> element in it, you can expand it a bit:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")
If you want to get what is inside that p element, you'll have to get the first element of that list again, and run decode_contents on it:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")[0].decode_contents(formatter="html")
In your case that will return:
u'120 East 4th Street<br/><span class="locality">Juneau</span>, <span class="region">AK</span>, <span class="postal-code">99801</span>'

How can I extract URLs from html content with ruby regexp?

Lets go directly with an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content i want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with ruby regex
This should give you some insight of how to do it.
https://regex101.com/r/wD4oT8/2
javascript:show\(\'(.*?)'.*?\'([^\']*)\'\) will capture the first argument as $1, last part within ' as $2, so you get what you want by substituting as $2/$1.
That's the regex part of it, and, of course, you can adjust the regex as you see fit, for example, to include the usage of " (javascript:show\((?:\'|\")(.*?)(?:\'|\").*?\'([^\'\"]*)(?:\'|\")\) or allow only with 3 arguments.
/yourregex/.match(yourstring) will extract the information you need.