I have following table cell:
<td class="text-right"
onmouseenter="$(this).find('.overlay-viewable-box:first').show();"
onmouseleave="$(this).find('.overlay-viewable-box:first').hide();">
2.004
</td>
It contains spaces and line breaks too. The class="text-right" isn't unique on the page, but the first - if it could help to relate on it.
I want to match only number (this one - 2.004, or any other, it is always only one number) - with or without the point and/ or comma in it.
PS: yes, i fully agreed that the idea to parse html with regex is not the best - any other method would be such kind of overhead, that it would be not worth to do:(
PPS: guys and guls - please write your recommendations as answers, not as comments, so i could accept and honorate them.
Solution: (?:<td\b.*?text-right\b.*?\D*?;">)([\s\S\d]*?)(?=\D*?<\/)
Edit: full length HTML:
<div class="box " >
<div class="box-head " >
<div class="box-icon">
<span class="icon "></span> </div>
<span class="divider"></span>
<div class="box-title box-title-space-1">
<span>Keyword-Profile</span></div>
<div class="box-options dropdown box-options-no-divider">
<div class="divider "></div>
<div class="box-icon "><a
class="button">
<span class="icon "></span> </a></div>
<ul class="dropdown-menu">
<li
> <a onclick="" class="modal"><div><div class="icon"><div></div></div><div class="text"> Add to Dashboard</div></div></a>
</li>
<li
><span class="box-menu-seperator"></span> <a onclick="
" href="" class="modal"><div><div class="icon"><div></div></div><div class="text"> Add to Report</div></div></a>
</li>
</ul>
</div>
</div>
<div class="module-loading-blocker">
<div class="module-loading-blocker-icon">
<div style="width: 40px; height: 40px; display: inline-block;">
<svg width="100%" height="100%" class="loading-circular" viewBox="0 0 50 50">
<circle class="loading-path" cx="25" cy="25" r="20" fill="none" stroke-width="5" stroke-miterlimit="10"/>
</svg>
</div> </div>
</div>
<div class="box-content box-body box-table" > <table class="table table-spaced">
<tr>
<td>
Top-10
</td>
<td class="text-right"
onmouseenter="$(this).find('.overlay-viewable-box:first').show();"
onmouseleave="$(this).find('.overlay-viewable-box:first').hide();">
2.004
</td>
</tr>
<tr>
<td>
Top-100
</td>
<td class="text-right"
onmouseenter="$(this).find('.overlay-viewable-box:first').show();"
onmouseleave="$(this).find('.overlay-viewable-box:first').hide();">
237.557
</td>
</tr>
<tr>
<td>
∅ Position
</td>
<td class="text-right"
onmouseenter="$(this).find('.overlay-viewable-box:first').show();"
onmouseleave="$(this).find('.overlay-viewable-box:first').hide();">
60
</td>
</tr>
</table>
</div></div><div class="module" style="display: none;">x</div>
Update (JavaScript RegExp)
To get the number within <td>
Ignoring the fact code will not function and to provide a Regex that'll get the number in the first td.text-right only try this:
/(?:<td\b.*?text-right\b.*?\D*?)([0-9]+?[.,]*?[0-9]*?)(?=\D*?<\/)/
|1|]=-------------------------------------=[|2|]=-----------------------=[|3|]=------------=|]
begin non-capture (?: literal <td word border d\s & zero to any number of char until \b.*? literal text-right word border t\s & zero to any number of char until \b.*? zero to any number of char that is not a number until \D*? end non-capture )
begin capture ( one to any number of numbers until [0-9]+? zero to any number of a literal . or , until [.,]*? zero to any number of numbers until [0-9]*? end capture )
begin positive look ahead (?= of zero to any number of any non-number char until \D*? literal with escaped forward slash <\/ end-positive look ahead )
Better Regex
This one concentrates on the fact that each target is on the last column by adding: <\/td>\s*?</tr> in a positive look ahead.
/\b([0-9]+?[.,]*?[0-9]*?)(?=\D*?<\/td>\s*?<\/tr>)/g;
It has a cleaner result both matching and capture groups are the same. No side effect non-capturing group.
Demo
var rgx = /\b([0-9]+?[.,]*?[0-9]*?)(?=\D*?<\/td>\s*?<\/tr>)/g;
var str = document.documentElement.innerHTML;
let hits;
while ((hits = rgx.exec(str)) !== null) {
if (hits.index === rgx.lastIndex) {
rgx.lastIndex++;
}
hits.forEach(function(hit, idx) {
console.log(`Found match, group ${idx}: ${hit}`);
});
}
<div class="box ">
<div class="box-head ">
<div class="box-icon">
<span class="icon ">&f0ae;</span> </div>
<span class="divider"></span>
<div class="box-title box-title-space-1">
<span>Keyword-Profile</span></div>
<div class="box-options dropdown box-options-no-divider">
<div class="divider "></div>
<div class="box-icon ">
<a class="button">
<span class="icon ">&f013;</span> </a>
</div>
<ul class="dropdown-menu">
<li>
<a onclick="" class="modal">
<div>
<div class="icon">
<div>&f055;</div>
</div>
<div class="text"> Add to Dashboard</div>
</div>
</a>
</li>
<li><span class="box-menu-seperator"></span>
<a onclick="
" href="" class="modal">
<div>
<div class="icon">
<div>&f055;</div>
</div>
<div class="text"> Add to Report</div>
</div>
</a>
</li>
</ul>
</div>
</div>
<div class="module-loading-blocker">
<div class="module-loading-blocker-icon">
<div style="width: 40px; height: 40px; display: inline-block;">
<svg width="100%" height="100%" class="loading-circular" viewBox="0 0 50 50">
<circle class="loading-path" cx="25" cy="25" r="20" fill="none" stroke-width="5" stroke-miterlimit="10"/>
</svg>
</div>
</div>
</div>
<div class="box-content box-body box-table">
<table class="table table-spaced">
<tr>
<td>
Top-10
</td>
<td class="text-right" onmouseenter="\$(this).find('.overlay-viewable-box:first').show();" onmouseleave="\$(this).find('.overlay-viewable-box:first').hide();">
2.004
</td>
</tr>
<tr>
<td>
Top-100
</td>
<td class="text-right" onmouseenter="\$(this).find('.overlay-viewable-box:first').show();" onmouseleave="\$(this).find('.overlay-viewable-box:first').hide();">
237.557
</td>
</tr>
<tr>
<td>
∅ Position
</td>
<td class="text-right" onmouseenter="\$(this).find('.overlay-viewable-box:first').show();" onmouseleave="\$(this).find('.overlay-viewable-box:first').hide();">
60
</td>
</tr>
</table>
</div>
</div>
<div class="module" style="display: none;">x</div>
A simple solution, provided that your parsing engine can search across lines, and supports lookarounds:
(?<=>\s*)([0-9]+(?:\.[0-9]+)?)(?=\s*<)
Explained:
The first part is (?<=>). (?<=regex) is called a positive lookbehind, which tells the parser to check if a pattern matching regex exists before the actual matching part. In this case it will look for any number of whitespaces after a >.
The core part, [0-9]+(\.[0-9]+)? matches one or more digits, optionally followed by a dot and another group of one or more digits. The last ? indicates that the decimal part is optional.
The last part is (?=<). (?=regex) is called a positive lookahead, which tells the parser to check if a pattern matching regex exists after the actual matching part. In this case it will look for any number of whitespaces, followed by a <.
Assuming your regex engine understands pcre, try
/>[\s]*([[:digit:]]+(\.[[:digit:]]+)?)[\s]*<\//g
to match a number optionally surrounded by whitespace ( including newline/linefeed characters ) which is the sole textual content of a html element. Capture group 1 holds the number.
You may need to adjust the pattern inside the capture group to cater for the kind of lexiclaisations you'd consider a 'number'.
Drop the start and the end of the expression ( ie. >, <\/ ) if the assumed structural html context is too restrictive for your purposes. Given your question you are aware that doing so increases the risk of false positives.
See it live at Regex101
Btw there are html parser libraries for most programming languages that allow for parsing lenient to syntax errors and sport simple interfaces to iterate over all textual content. Just for the sake of the argument, if jQuery or some similar functionality is available, you may proceed along the lines of this SO answer ( just replace the inner return expression with a regex test, like (untested code):
var re = RegExp('[[:digit:]]+(\.[[:digit:]]+)?', 'g');
$.fn.findByREText = function (re) {
$('*').contents().filter(function () {
return re.test($(this).text.trim());
});
};
Related
I'm trying to grab the value from the lights node, based on a house number set in a parameter. The problem is, based on certain conditions, houses may be in different row positions.
If the parameter being sent to me for the house number is House237, then how to I get the number of lights located within the row-2-Lights node?
Also, how do I do the same if the next run, the house number is House867? Below is my HTML:
<?xml version='1.0' encoding='utf-8'?>
<table id="neighborhood">
<tr onmouseover="leave('1')">
<td id="row-1-house">
<div class="houseCol">
<a href="#" onClick="goHome('867');return false">
House867
</a>
</div>
</td>
<td id="row-1-Lights">
<div class="decimal">14</div>
</td>
</tr>
<tr onmouseover="leave('2')">
<td id="row-2-house">
<div class="houseCol">
<a href="#" onClick="goHome('237');return false">
House237
</a>
</div>
</td>
<td id="row-2-Lights">
<div class="decimal">12</div>
</td>
</tr>
</table>
You can try the following XPath-1.0 expression. The parameter is the 'HouseXXX' string, the child of the a element.
/table[#id='neighborhood']/tr[td/div[#class='houseCol']/a[normalize-space(text())='House237']]/td[contains(#id,'Lights')]/div[#class='decimal']/text()
The output of this is
12
In this example the parameter is set to 'House237'. How you incorporate the parameter into the XPath expression depends on your usecase scenario.
For example, in XSLT you would replace 'House237' with a variable like $HouseNumber to set the parameter.
<div id="eventInfoContainer">
<table>
<tbody><tr>
<td class="verticalTop">
<script type="text/javascript"><!--
google_ad_client = "ca-pub-2475575566915822";
/* listing page */
google_ad_slot = "4647770957";
google_ad_width = 160;
google_ad_height = 600;
//-->
</script>
<script type="text/javascript" src="https://pagead2.googlesyndication.com/pagead/show_ads.js">
</script><ins id="aswift_0_expand" style="display:inline-table;border:none;height:600px;margin:0;padding:0;position:relative;visibility:visible;width:160px;background-color:transparent;"><ins id="aswift_0_anchor" style="display:block;border:none;height:600px;margin:0;padding:0;position:relative;visibility:visible;width:160px;background-color:transparent;"><iframe width="160" height="600" frameborder="0" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" scrolling="no" allowfullscreen="true" onload="var i=this.id,s=window.google_iframe_oncopy,H=s&&s.handlers,h=H&&H[i],w=this.contentWindow,d;try{d=w.document}catch(e){}if(h&&d&&(!d.body||!d.body.firstChild)){if(h.call){setTimeout(h,0)}else if(h.match){try{h=s.upd(h,i)}catch(e){}w.location.replace(h)}}" id="aswift_0" name="aswift_0" style="left:0;position:absolute;top:0;width:160px;height:600px;"></iframe></ins></ins>
</td>
<td class="spacer30w"></td>
<td class="verticalTop">
<span id="eventNameHeader">The Future of Medicine, Health Care and Biological Studies</span>
<br>
<br>
<span id="smallerHeading">Conference</span>
<br>
<br>
<span id="eventDate">16th to 17th October 2017</span>
<br>
<span id="eventCountry">Rockville, Maryland, United States of America</span>
<br>
<br>
<span id="eventWebsite">
<span id="smallerHeading">Website: </span>
http://rais.education/the-future-of-medicine-health-care-and-biological-studies/
</span>
<br>
<span id="eventContactPerson"><span id="smallerHeading">Contact person: </span>Eduard David</span>
<br>
<br>
<span id="eventDescription">We gladly invite you to attend the International Conference The Future of Medicine, Health Care and Biological Studies which will be held at Johns Hopkins University, just 20 miles away from Washington DC. </span>
<br>
<br>
<span id="eventOrganiser"><span style="font-weight: bold; color: #696969;">Organized by: </span>Research Association for Interdisciplinary Studies (RAIS)</span> <br><span id="eventDeadline"><span style="font-weight: bold; color: #696969;">Deadline for abstracts/proposals: </span>21st August 2017</span> <br>
<br>
Check the event website for more details.
<br>
<br>
<br>
<br>
<br>
<br>
<table>
<tbody><tr>
<td class="verticalMiddle">
<form><input type="button" value="Back" onclick="history.go(-1); return true;"></form>
</td>
<td class="spacer15w"></td>
<td class="verticalMiddle">
<a title="Share this conference on Facebook" href="http://www.facebook.com/sharer.php?
s=100
&p[url]=http://www.conferencealerts.com/show-event?id=187457 &p[title]=The Future of Medicine, Health Care and Biological Studies &p[summary]=We gladly invite you to attend the International Conference The Future of Medicine, Health Care and Biological Studies which will be held at Johns Hopkins University, just 20 miles away from Washington DC. " target="_blank" class="fb_share_link">Share on Facebook</a>
</td>
<td class="spacer15w"></td>
<td>
<img src="http://www.google.com/calendar/images/ext/gc_button6.gif" border="0" align="left">
</td>
</tr>
<tr><td class="spacer5"></td></tr>
<tr>
<td colspan="5">
<script type="text/javascript"><!--
google_ad_client = "ca-pub-2475575566915822";
/* show event under content */
google_ad_slot = "8943315143";
google_ad_width = 300;
google_ad_height = 250;
//-->
</script>
<script type="text/javascript" src="https://pagead2.googlesyndication.com/pagead/show_ads.js">
</script><ins id="aswift_1_expand" style="display:inline-table;border:none;height:250px;margin:0;padding:0;position:relative;visibility:visible;width:300px;background-color:transparent;"><ins id="aswift_1_anchor" style="display:block;border:none;height:250px;margin:0;padding:0;position:relative;visibility:visible;width:300px;background-color:transparent;"><iframe width="300" height="250" frameborder="0" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" scrolling="no" allowfullscreen="true" onload="var i=this.id,s=window.google_iframe_oncopy,H=s&&s.handlers,h=H&&H[i],w=this.contentWindow,d;try{d=w.document}catch(e){}if(h&&d&&(!d.body||!d.body.firstChild)){if(h.call){setTimeout(h,0)}else if(h.match){try{h=s.upd(h,i)}catch(e){}w.location.replace(h)}}" id="aswift_1" name="aswift_1" style="left:0;position:absolute;top:0;width:300px;height:250px;"></iframe></ins></ins>
</td>
</tr>
</tbody></table>
<br>
</td>
</tr>
</tbody></table>
</div>
How to get text "The Future of Medicine, Health Care and Biological Studies" from above code in python using scrapy?
I tried this code
response.css('div.eventInfoContainer table tbody tr td:nth-child(3) span::text').extract()
But o/p getting like this "[]"
As the span element that contains the required information has an id attribute (which should be unique), this should suffice:
text = response.css('span#eventNameHeader::text').extract_first()
EDIT:
Using XPath, it's similar:
text = response.xpath('//span[#id="eventNameHeader"]/text()').extract_first()
I need one help, i want to replace the href link to my link within a particular div class only.
<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb">
<b class="icon-star"></b> N/A
</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>
</div>
Here i want to change http://oldsite.com/ to http://newsite.com/?id=
i want these href links like
<a href="http://newsite.com/?id=the-fate-of-the-furious">
Please help me with preg_replace regular expression.
Thanks
this may help you
$content = get_the_content();
$pattern = "/(?<=href=(\"|'))[^\"']+(?=(\"|'))/";
$newurl = get_permalink();
$content = preg_replace($pattern,$newurl,$content);
echo $content;
Lookbehinds are too expensive, use \K to start the fullstring match and avoid a capture group.
<a href="\K[^"]+\/ This pattern will be very efficient. I should state that this pattern will match ALL <a href urls. It also matches greedily until it finds the last / in the url -- I assume this is okay by your input sample.
Pattern Demo
Code (PHP Demo):
$in='<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb"><b class="icon-star"></b> N/A</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>';
echo preg_replace('/<a href="\K[^"]+\//','http://newsite.com/?id=',$in);
Output:
<div id="slider1" class="owl-carousel owl-theme">
<div class="item">
<div class="imagens">
<img src="https://image.oldste.org" alt="The Fate of the Furious" width="100%" height="100%" />
<span class="imdb"><b class="icon-star"></b> N/A</span>
</div>
<span class="ttps">The Fate of the Furious</span>
<span class="ytps">2017</span>
</div>
I want to use regular expression to match the following html table:
<tbody class=\"DocTableBody \">
<tr data-fastRow=\"1\" class=\"DataRow TDRE\">
<td id=\"g-f-1\" class=\"TDC FieldDisabled Field TCLeft CellText g-f\" >
<div class=\"DTC\">
<label id=\"c_g-f-1\" class=\"DCC\" >01-Apr-2015</label>
</div>
</td>
<td id=\"g-g-1\" class=\"TDC FieldDisabled Field TCLeft CellTextHtml g-g\" >
<div class=\"DTC\">
<label id=\"c_g-g-1\" class=\"DCC\" >ACTIVE</label>
</div>
</td>
</tr>
<tr data-fastRow=\"2\" class=\"DataRow TDRO\">
<td id=\"g-f-2\" class=\"TDC FieldDisabled Field TCLeft CellText g-f\" >
<div class=\"DTC\">
<label id=\"c_g-f-2\" class=\"DCC\" >01-Apr-2015</label>
</div>
</td>
<td id=\"g-g-2\" class=\"TDC FieldDisabled Field TCLeft CellTextHtml g-g\" >
<div class=\"DTC\">
<label id=\"c_g-g-2\" class=\"DCC\" >ACTIVE</label>
</div>
</td>
</tr>
</tbody>
I expected to extract the following value:
"1"
01-Apr-2015
ACTIVE
"2"
01-Apr-2015
ACTIVE
I tried the following to extract the value in data-fastRow:
(?sUi)<tr data-fastRow=\\"(\d+)\\".+>.*<\/tr>
But I couldn't extract the nested items in <label.+>(.*)</label> in single regular expression.
Is that possible to extract parent and nested items in single regular expression?
It's a really bad idea to parse HTML with regular expressions.
Each languahe has its own libraries to parse HTML.
In Python for example you have BeautifulSoup.
It's by far much better to use such libraries.
Usually, such libraries has jQuery-Selector-like interface (or something like that), which allows you to find your data with extremely easy queries.
I have the following HTML:
<div>
<table>
<tr>
<td>
<div class="w135">
<div style="float: left; padding-right: 10px;" class="imageThumbnail playerDiv">
<a href="/sport/tennis/2014/10/djokovic-through-wozniacki-out-china-open-2014101114115427766.html" id="ctl00_ctl00_DataList1_ctl00_Thumbnail1_lnkImage10" target="_parent">
<img src="/mritems/imagecache/89/135/mritems/images/2014/10/1/2014101114447491734_20.jpg" id="ctl00_ctl00_DataList1_ctl00_Thumbnail1_imgSmall10" border="0" class="imageThumbnail">
</a>
</div>
</div>
</td>
</tr>
</table>
</div>
When i attempt the rake, i get the error:
NoMethodError: undefined method `at_css' for ["id","ctl00_cphBody_ctl01_DataList1_ctl00_Thumbnail1_Layout17"]:Array
This is the code:
#request = HTTParty.get(url)
#html = Nokogiri::HTML(#request.body)
#html.css(".w135")[0].map do |item|
url = item.at_css("div.playerDiv a")
puts url.inspect
end
I'm really not sure what the issue is and have been trying to fix this for a while. The error occurs on this line url = item.at_css("div.playerDiv a")
Any suggestion is appreciated!
Thanks
I'd do it using something like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div>
<table>
<tr>
<td>
<div class="w135">
<div style="float: left; padding-right: 10px;" class="imageThumbnail playerDiv">
<a href="/sport/tennis/2014/10/djokovic-through-wozniacki-out-china-open-2014101114115427766.html" id="ctl00_ctl00_DataList1_ctl00_Thumbnail1_lnkImage10" target="_parent">
<img src="/mritems/imagecache/89/135/mritems/images/2014/10/1/2014101114447491734_20.jpg" id="ctl00_ctl00_DataList1_ctl00_Thumbnail1_imgSmall10" border="0" class="imageThumbnail">
</a>
</div>
</div>
</td>
</tr>
</table>
</div>
EOT
puts doc.search('.w135 div.playerDiv a').map(&:inspect)
Which outputs:
# >> #<Nokogiri::XML::Element:0x3ff0918b132c name="a" attributes=[#<Nokogiri::XML::Attr:0x3ff0918b1250 name="href" value="/sport/tennis/2014/10/djokovic-through-wozniacki-out-china-open-2014101114115427766.html">, #<Nokogiri::XML::Attr:0x3ff0918b123c name="id" value="ctl00_ctl00_DataList1_ctl00_Thumbnail1_lnkImage10">, #<Nokogiri::XML::Attr:0x3ff0918b1228 name="target" value="_parent">] children=[#<Nokogiri::XML::Text:0x3ff0918a5b6c "\n ">, #<Nokogiri::XML::Element:0x3ff0918a5360 name="img" attributes=[#<Nokogiri::XML::Attr:0x3ff0918a4d20 name="src" value="/mritems/imagecache/89/135/mritems/images/2014/10/1/2014101114447491734_20.jpg">, #<Nokogiri::XML::Attr:0x3ff0918a4cbc name="id" value="ctl00_ctl00_DataList1_ctl00_Thumbnail1_imgSmall10">, #<Nokogiri::XML::Attr:0x3ff0918a4b90 name="border" value="0">, #<Nokogiri::XML::Attr:0x3ff0918a4a28 name="class" value="imageThumbnail">]>, #<Nokogiri::XML::Text:0x3ff091871920 "\n ">]>
If you're trying to access the "href" parameter, instead of using inspect, use:
puts doc.search('.w135 div.playerDiv a').map{ |n| n['href'] }
# >> /sport/tennis/2014/10/djokovic-through-wozniacki-out-china-open-2014101114115427766.html