How can I extract URLs from html content with ruby regexp? - regex

Lets go directly with an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content i want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with ruby regex

This should give you some insight of how to do it.
https://regex101.com/r/wD4oT8/2
javascript:show\(\'(.*?)'.*?\'([^\']*)\'\) will capture the first argument as $1, last part within ' as $2, so you get what you want by substituting as $2/$1.
That's the regex part of it, and, of course, you can adjust the regex as you see fit, for example, to include the usage of " (javascript:show\((?:\'|\")(.*?)(?:\'|\").*?\'([^\'\"]*)(?:\'|\")\) or allow only with 3 arguments.
/yourregex/.match(yourstring) will extract the information you need.

Related

preg_replace regular expression to replace link within a particular tags

I need one help, i want to replace the href link to my link within a particular div class only.
<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb">
<b class="icon-star"></b> N/A
</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>
</div>
Here i want to change http://oldsite.com/ to http://newsite.com/?id=
i want these href links like
<a href="http://newsite.com/?id=the-fate-of-the-furious">
Please help me with preg_replace regular expression.
Thanks
this may help you
$content = get_the_content();
$pattern = "/(?<=href=(\"|'))[^\"']+(?=(\"|'))/";
$newurl = get_permalink();
$content = preg_replace($pattern,$newurl,$content);
echo $content;
Lookbehinds are too expensive, use \K to start the fullstring match and avoid a capture group.
<a href="\K[^"]+\/ This pattern will be very efficient. I should state that this pattern will match ALL <a href urls. It also matches greedily until it finds the last / in the url -- I assume this is okay by your input sample.
Pattern Demo
Code (PHP Demo):
$in='<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb"><b class="icon-star"></b> N/A</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>';
echo preg_replace('/<a href="\K[^"]+\//','http://newsite.com/?id=',$in);
Output:
<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb"><b class="icon-star"></b> N/A</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>

select element by driver.find_element_by_xpath used by selenium for span class

I'm trying to scrape data and click ot from a page using xpath . For example, the content I want is in the following format
<div class="x-grid-cell-inner x-grid-cell-inner-treecolumn" style="text-align:left;" unselectable="on">
<img class=" x-tree-elbow-img x-tree-elbow-empty" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==">
<img class=" x-tree-elbow-img x-tree-elbow-line" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==">
<img class=" x-tree-elbow-img x-tree-elbow" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==">
<img class=" x-tree-icon x-tree-icon-leaf " role="presentation" src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==">
<span class="x-tree-node-text ">Chassis</span>
I have used the function //span[contains(#class, 'x-tree-node-text')].Chassis but its is not returning anything.
Any Help?
Are you trying to find the text "Chassis" as well? It's hard to see bc you haven't put your XPath in backticks.
But if so, then your XPath is wrong. You have to use:
//span[contains(#class, 'x-tree-node-text')][.='Chassis']
Try below XPath:-
//span[#class='x-tree-node-text ']
OR more specific
//span[#class='x-tree-node-text ' and contains(.,'Chassis')]
Hope it will help you :)

regex to remove recurring instances of comment tag

Hello I want to remove all recurring instances of comment tag which occurs in a data.
Data which I am using is mentioned below
<!-- <li><a class="topitemlink" href="/About-Us/Career-Centre.aspx">Career Centre</a></li>
<li><img alt="" width="7" height="22" src="/images/common/separator.gif" /></li>-->
<li><a class="topitemlink" href="/ContactUs">Contact Us</a> <!-- <ul class="topcontactusmenu"><li>Contact Us</li><li>Contact the IR Team</li><li>Contact the Media Team</li></ul> --></li>
</ul>
</div>
<!--<img width="92" height="40" src="/ABMB/media/MyLibrary/Shared/Images/bizSmart_logo.gif" alt="" /><img width="76" height="40" src="/ABMB/media/MyLibrary/Shared/Images/sabah-run2015_top-icon.jpg" alt="" />-->
The regex I am using just captures the first instance but I want all instances to be captured.
<!--.*\s.*-->
You could use something like so: <!--.+?--> (Example here). Make sure that you have the sg flag enabled.
The s flag would allow the period character to also match new line feeds, thus allowing you to capture comments which span multiple lines.
The g flag will apply the pattern globally, that is, to the entire text.
You didn't specify the language you're using but for php you can use /<!--.*?-->/s , i.e.:
$html = '<!-- <li><a class="topitemlink" href="/About-Us/Career-Centre.aspx">Career Centre</a></li>
<li><img alt="" width="7" height="22" src="/images/common/separator.gif" /></li>-->
<li><a class="topitemlink" href="/ContactUs">Contact Us</a> <!-- <ul class="topcontactusmenu"><li>Contact Us</li><li>Contact the IR Team</li><li>Contact the Media Team</li></ul> --></li>
</ul>
</div>
<!--<img width="92" height="40" src="/ABMB/media/MyLibrary/Shared/Images/bizSmart_logo.gif" alt="" /><img width="76" height="40" src="/ABMB/media/MyLibrary/Shared/Images/sabah-run2015_top-icon.jpg" alt="" />-->';
$html = preg_replace('/<!--.*?-->/s', '', $html);
echo $html;
/*<li><a class="topitemlink" href="/ContactUs">Contact Us</a> </li>
</ul>
</div>*/
DEMO:
https://ideone.com/It6HvW
EXPLANATION:
<!--.*?-->
Options: Case sensitive; Exact spacing; Dot matches line breaks; ^$ don’t match at line breaks; Greedy quantifiers; Regex syntax only
Match the character string “<!--” literally «<!--»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character string “-->” literally «-->»

Regular expression for exactly one match

I am using the following regular expression in my code editor (sublime text) in order to search for the ASP.NET comments.
<%--.*(\n.*)*--%>
I want this regular expression to stop looking any forward as soon as the first --%> is found. But it keeps looking until the last comment's --%> is found. I have got this idea that i've to use some kind of flag to make it stop as soon as the first --%> but I am unable to figure it out.
Can anyone please tell me how may I modify this regex?
UPDATE
I forgot to post some sample markup. Here it is:
<div class="modal-footer">
<%--<button class="btn" data-dismiss="modal">
Close</button>
<button id="btnAddCountry" class="btn btn-primary" data-dismiss="modal">
Save changes</button>--%>
</div>
</div>
<div class="row-fluid">
<div class="span12">
<div class="box paint_hover">
<div class="title">
<h3>Sale Voucher</span>
</h3>
</div>
<div class="content">
<ul id="tabExample1" class="nav nav-tabs">
<li class="active"><a id="lnkAddEditVoucher" href="#AddEditVoucher" data-toggle="tab">Add/Update Sale Voucher</a></li>
<li><a id="lnkViewVouchers" href="#ViewVouchers" data-toggle="tab">Search Sale Voucher</a></li>
<%-- <li><a id="lnkViewParties" href="#ViewParties" data-toggle="tab">Search Parties</a></li>--%>
</ul>
I just want to match the first comment and not the second one.
You need to make the * quantifiers non-greedy. Usually this is done by adding a ? after them, e.g. .*? instead of just .*.
I've also simplified the regex a bit. Sublime Text supports the (?s) modifier at the beginning of the pattern to make the dot match even newlines:
(?s)<%--.*?--%>
If you prefer matching the newline explicitly:
<%--(.|\n)*?--%>
The problem you seem to have is that you use the greedy version of .*, which matches anything (including --%>). Try using <%--.*?(\n.*?)*?--%> instead to make it non-greedy.

Regular expression to remove lines with special characters

<a class='jdr' href='javascript:void(0);' onClick="return openDiv('jrtp');"></a>
<span class="jcn">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET" title='Aptech N Power Hardware & Networking' >Aptech N Power Hardware & Networkin...</a>
</span>
<section class="jrat">
<a rel="nofollow" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw"><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s0'></span></a>
<a class="jrt" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw">2 ratings</a>
<span class="jrt"> |</span>
<a class="rate_this" onclick="_ct('ratethis','lspg');" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET/writereview">Rate this</a>
</section>
<section class="jcar">
<section class="jbc">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET">
<img width="83" height="56" border="0" src="http://images.jdmagicbox.com/upload_test/ahmedabad/b4/079pxx79.xx79.110420172948.d4b4/logo/faf3f2409ed7993aaa70f848ab0bb6fb_t.jpg" class="Clogo" />
</a>
<!-- <span class="noLogo"></span> -->
<section class="jrcl">
<p>
**A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024** | View Map<br>
</p>
From the above XML data I want to extract the following---
A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024
I need help in creating a regular expression to find and remove all lines containing special characters.
I am using the following regex ----
/(\<.+?>)/g
Please help.Thanks
Try this
/(?<=\*{2})([^<>]*?)(?=\*{2})/g
it matches all content between the **.
I think you want to remove lines which are HTML tags, so try this:
/^<.*>\n/g