Using the contains function in xpath 1.0 to select numbers - regex

I'm using scrapy and I need to scrape something like this: any number, followed by a dash, followed by any number, then a whitespace, then two letters (e.g. 1-3 mm). It seems xpath 1.0 does not allow the use of regex. Searching around, I've found some workarounds like using starts-with() and ends-with() but from what I've seen they only use it with letters. Please help.

Scrapy uses lxml internally, and lxml's XPath has support for regular expressions via EXSLT when you add the corresponding namespaces.
Scrapy does that by default so you can use re:test() within XPath expressions as a boolean for predicates.
boolean re:test(string, string, string?)
The re:test function returns true if the string given as the first argument matches the regular expression given as the second argument.
See this example Python2 session:
>>> import scrapy
>>> t = u"""<!DOCTYPE html>
... <html lang="en">
... <body>
... <p>ab-34mm</p>
... <p>102-d mm</p>
... <p>15-22 µm</p>
... <p>1-3 nm</p>
... </body>
... </html>"""
>>> selector = scrapy.Selector(text=t)
>>> selector.xpath(r'//p/text()[re:test(., "\d+-\d+\s\w{2}")]').extract()
[u'15-22 \xb5m', u'1-3 nm']
>>>
Edit: note on using EXSLT re:match
Using EXSLT re:match is a bit trickier, or at least less natural than re:test. re:match is similar to Python's re.match, which returns MatchObject
The signature is different from re:test:
object regexp:match(string, string, string?)
The regexp:match function returns a node set of match elements
So re:match will return <match> elements. To capture the string from these <match> elements, you need to use the function as the "outer" function, not inside predicates.
The following example chains XPath expressions,
selecting <p> paragraphs
then matching each paragraph string-value (normalized) with a regular expression containing parenthesized groups
finally extracting the result of these re:match calls
Python2 shell:
>>> for p in selector.xpath('//p'):
... print(p.xpath(ur're:match(normalize-space(.), "(\d+)-(\d+)\s(\w{2})")').extract())
...
[]
[]
[u'<match>15-22 \xb5m</match>', u'<match>15</match>', u'<match>22</match>', u'<match>\xb5m</match>']
[u'<match>1-3 nm</match>', u'<match>1</match>', u'<match>3</match>', u'<match>nm</match>']
>>>

To do this with xpath 1.0 you can use the translate function.
translate(#test , '1234567890', '..........') will replace any number (digit) with a dot.
If your numbers are always one digit you may try something like:
[translate(#test , '1234567890', '..........') = '.-. mm']
if the numbers could be longer than on digit you may try to replace numbers with nothing and test for - mm
[translate(#test , '1234567890', '') = '- mm']
But this can have some false trues. To avoid them you will need to check with substring-before -after length if there was at least one digit

Related

How find XPATH with random number value in attribute?

I have div blocks on website like this: <div id="banner-XXX-1"></div>
So I need to query this banner, where XXX is any digit number.
How to do that? Currently I use this way:
//div[contains(#id,'banner-') and contains(#id,'-1')]
But this way is not good if XXX starts with 1. So, is there any way to do like this: //div[contains(#id,'banner-' + <any_decimal> + '-1')]?
It seems match operator on popular Chrome plugin XPath Helper does not work, so I use v1.0
https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en
XPath 1.0
This XPath 1.0 expression,
//div[ starts-with(#id,'banner-')
and translate(substring(#id, 8, 3), '0123456789', '') = ''
and substring(#id, 11) = '-1']
selects all div elements whose id attribute value
starts with banner-,
followed by 3 digits, which a translate() trick mapped to nothing,
followed by -1,
as requested.
XPath 2.0
This XPath 2.0 expression,
//div[matches(#id,'^banner-\d{3}-1$')]
selects all div elements whose id attribute value matches the shown regex and
starts (^) with banner-,
followed by 3 digits, (\d{3}),
and ends ($) with -1,
as requested.

Is there a regular expression for finding all question sentences from a webpage?

I am trying to extract some questions from a web site using BeautifulSoup, and want to use regular expression to get these questions from the web. Is my regular expression incorrect? And how can I combine soup.find_all with re.compile?
I have tried the following:
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
import urllib
import re
url = "https://www.sanfoundry.com/python-questions-answers-variable-names/"
headers = {'User-Agent':'Mozilla/5.0'}
page = requests.get(url)
soup = BeautifulSoup(page.text, "lxml")
a = soup.find_all("p")
for m in a:
print(m.get_text())
Now I have some text containing the questions like "1. Is Python case sensitive when dealing with identifiers?". I want to use r"[^.!?]+\?" to filter out the unwanted text, but I have the following error:
a = soup.find_all("p" : re.compile(r'[^.!?]+\?'))
a = soup.find_all("p" : re.compile(r'[^.!?]+\?'))
^
SyntaxError: invalid syntax
I checked my regular expression on https://regex101.com, it seems right. Is there a way to combine the regular expression and soup.find_all together?
One of methods to find p elements containig a ? it to
define a criterion function:
def criterion(tag):
return tag.name == 'p' and re.search('\?', tag.text)
and use it in find_all:
pars = soup.find_all(criterion)
But you want to print only questions, not the whole paragraphs
from pars.
To match these questions, define a pattern:
pat = re.compile(r'\d+\.\s[^?]+\?')
(a sequence of digits, a dot, a space, then a sequence of chars other
than ? and finally a ?).
Note that in general case one paragraph may contain multiple
questions. So the loop processing the paragraphs found should:
use findall to find all questions in the current paragraph
(the result is a list of found strings),
print also all of them, in separate lines, so you should
use join with a \n as a separator.
So the whole loop should be:
for m in pars:
questions = pat.findall(m.get_text())
print('\n'.join(questions))
Not a big regex fan, so tried this:
for q in a:
for i in q:
if '?' in i:
print(i)
Output:
1. Is Python case sensitive when dealing with identifiers?
2. What is the maximum possible length of an identifier?
3. Which of the following is invalid?
4. Which of the following is an invalid variable?
5. Why are local variable names beginning with an underscore discouraged?
6. Which of the following is not a keyword?
8. Which of the following is true for variable names in Python?
9. Which of the following is an invalid statement?
10. Which of the following cannot be a variable?

Scala regex find/replace with additional formatting

I'm trying to replace parts of a string that contains what should be dates, but which are possibly in an impermissible format. Specifically, all of the dates are in the form "mm/dd/YYYY" and they need to be in the form "YYYY-mm-dd". One caveat is that the original dates may not exactly be in the mm/dd/YYYY format; some are like "5/6/2015". For example, if
val x = "where date >= '05/06/2017'"
then
x.replaceAll("'([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})'", "'$3-$1-$2'")
performs the desired replacement (returns "2017-05-06"), but for
val y = "where date >= '5/6/2017'"
this does not return the desired replacement (returns "2017-5-6" -- for me, an invalid representation). With the Joda Time wrapper nscala-time, I've tried capturing the dates and then reformatting them:
import com.github.nscala_time.time.Imports._
import org.joda.time.DateTime
val f = DateTimeFormat.forPattern("yyyy-MM-dd")
y.replaceAll("'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'",
"'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime("$1"))+"'")
But this fails with a java.lang.IllegalArgumentException: Invalid format: "$1". I've also tried using the f interpolator and padding with 0s, but it doesn't seem to like that either.
Are you not able to do additional processing on the captured groups ($1, etc.) inside the replaceAll? If not, how else can I achieve the desired result?
The $1 like backreferences can only be used inside string replacement patterns. In your code, "$1" is not a backreference any longer.
You may use a "callback" with replaceAllIn to actually get the match object and access its groups to further manipulate them:
val pattern = "'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'".r
y = pattern replaceAllIn (y, m => "'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime(m.group(1)))+"'")
Regex.replaceAllIn is overloaded and can take a Match => String.

scrapy/xpaths/regex: proper xpath/re to ignore "link interjections"

I am scraping some korean language text and I come across a lot of "link interjection" for lack of a better word, where the html looks like this...
는 좋아요
it shows '저' as a hyperlink but the '는 좋아요' as regular text. They are in reality part of the same word object and display on the page as '저는 좋아요요' but when scraping using this xpath and regex...
foo = response.xpath('//*[#id="divID"]/p//text()').re(ur'[\uac00-\ud7af]+')
it is broken into two words in a list...
foo == ['저', '는', '좋아요']
How can I get this to keep it as one word, as was my original intent?
intended: foo == ['좋는', '좋아요']
EDIT: (comment response)
the problem with .join() is that it will join all the regularly scraped words as well as far as I can tell. So I would end up with this...
''.join(foo) == ['좋는좋아요']
So I do not think that .join() will work unless there is something I am missing
If you want to work on the string representation of an HTML element, XPath has a string() function that can be very helpful.
Once you have a single string for the element, you can apply regular expressions for words.
Here's a sample python interpreter session (I had to change your markup a bit to match the results you showed):
>>> import scrapy
>>>
>>> response = scrapy.Selector(text=u'<p>저는 좋아요</p>')
.//text() will select all descendant text nodes, as individual strings when .extract()ed (2 strings in this case):
>>> response.xpath('.//p//text()').extract()
[u'\uc800', u'\ub294 \uc88b\uc544\uc694']
And with the regex, you'll find 1 word, then 2 words:
>>> response.xpath('.//p//text()').re(ur'[\uac00-\ud7af]+')
[u'\uc800', u'\ub294', u'\uc88b\uc544\uc694']
>>> for e in response.xpath('.//p//text()').re(ur'[\uac00-\ud7af]+'):
... print e
...
저
는
좋아요
If you use XPath string() function on the paragraph element, you get a single string, even if the element has other children like a:
>>> response.xpath('string(.//p)').extract()
[u'\uc800\ub294 \uc88b\uc544\uc694']
>>> print response.xpath('string(.//p)').extract_first()
저는 좋아요
And you can then apply your regular expression to split on words:
>>> response.xpath('string(.//p)').re(ur'[\uac00-\ud7af]+')
[u'\uc800\ub294', u'\uc88b\uc544\uc694']
>>> for e in response.xpath('string(.//p)').re(ur'[\uac00-\ud7af]+'):
... print e
...
저는
좋아요
Note that string(node-set) only considers the first element in the node-set you pass as argument, so make sure your XPath expression first matches the element you want, or you can also chain XPath expression with scrapy selectors:
>>> for e in response.xpath('.//p').xpath('string(.)').re(ur'[\uac00-\ud7af]+'):
... print e
...
저는
좋아요

RegEx for a price in £

i have: \£\d+\.\d\d
should find: £6.95 £16.95 etc
+ is one or more
\. is the dot
\d is for a digit
am i wrong? :(
JavaScript for Greasemonkey
// ==UserScript==
// #name CurConvertor
// #namespace CurConvertor
// #description noam smadja
// #include http://www.zavvi.com/*
// ==/UserScript==
textNodes = document.evaluate(
"//text()",
document,
null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
null);
var searchRE = /\£[0-9]\+.[0-9][0-9];
var replace = 'pling';
for (var i=0;i<textNodes.snapshotLength;i++) {
var node = textNodes.snapshotItem(i);
node.data = node.data.replace(searchRE, replace);
}
when i change the regex to /Free for example it finds and changes. but i guess i am missing something!
Had this written up for your last question just before it was deleted.
Here are the problems you're having with your GM script.
You're checking absolutely every
text node on the page for some
reason. This isn't causing it to
break but it's unnecessary and slow.
It would be better to look for text
nodes inside .price nodes and .rrp
.strike nodes instead.
When creating new regexp objects in
this way, backslashes must be
escaped, ex:
var searchRE = new
RegExp('\\d\\d','gi');
not
var
searchRE = new RegExp('\d\d','gi');
So you can add the backslashes, or
create your regex like this:
var
searchRE = /\d\d/gi;
Your actual regular expression is
only checking for numbers like
##ANYCHARACTER##, and will ignore £5.00 and £128.24
Your replacement needs to be either
a string or a callback function, not
a regular expression object.
Putting it all together
textNodes = document.evaluate(
"//p[contains(#class,'price')]/text() | //p[contains(#class,'rrp')]/span[contains(#class,'strike')]/text()",
document,
null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
null);
var searchRE = /£(\d+\.\d\d)/gi;
var replace = function(str,p1){return "₪" + ( (p1*5.67).toFixed(2) );}
for (var i=0,l=textNodes.snapshotLength;i<l;i++) {
var node = textNodes.snapshotItem(i);
node.data = node.data.replace(searchRE, replace);
}
Changes:
Xpath now includes only p.price and p.rrp span.strke nodes
Search regular expression created with /regex/ instead of new RegExp
Search variable now includes target currency symbol
Replace variable is now a function that replaces the currency symbol with a new symbol, and multiplies the first matched substring with substring * 5.67
for loop sets a variable to the snapshot length at the beginning of the loop, instead of checking textNodes.snapshotLength at the beginning of every loop.
Hope that helps!
[edit]Some of these points don't apply, as the original question changed a few times, but the final script is relevant, and the points may still be of interest to you for why your script was failing originally.
You are not wrong, but there are a few things to watch out for:
The £ sign is not a standard ASCII character so you may have encoding issue, or you may need to enable a unicode option on your regular expression.
The use of \d is not supported in all regular expression engines. [0-9] or [[:digit:]] are other possibilities.
To get a better answer, say which language you are using, and preferably also post your source code.
£[0-9]+(,[0-9]{3})*\.[0-9]{2}$
this will match anything from £dd.dd to £d[dd]*,ddd.dd. So it can fetch millions and hundreds as well.
The above regexp is not strict in terms of syntaxes. You can have, for example: 1123213123.23
Now, if you want an even strict regexp, and you're 100% sure that the prices will follow the comma and period syntaxes accordingly, then use
£[0-9]{1,3}(,[0-9]{3})*\.[0-9]{2}$
Try your regexps here to see what works for you and what not http://tools.netshiftmedia.com/regexlibrary/
It depends on what flavour of regex you are using - what is the programming language?
some older versions of regex require the + to be escaped - sed and vi for example.
Also some older versions of regex do not recognise \d as matching a digit.
Most modern regex follow the perl syntax and £\d+\.\d\d should do the trick, but it does also depend on how the £ is encoded - if the string you are matching encodes it differently from the regex then it will not match.
Here is an example in Python - the £ character is represented differently in a regular string and a unicode string (prefixed with a u):
>>> "£"
'\xc2\xa3'
>>> u"£"
u'\xa3'
>>> import re
>>> print re.match("£", u"£")
None
>>> print re.match(u"£", "£")
None
>>> print re.match(u"£", u"£")
<_sre.SRE_Match object at 0x7ef34de8>
>>> print re.match("£", "£")
<_sre.SRE_Match object at 0x7ef34e90>
>>>
£ isn't an ascii character, so you need to work out encodings. Depending on the language, you will either need to escape the byte(s) of £ in the regex, or convert all the strings into Unicode before applying the regex.
In Ruby you could just write the following
/£\d+.\d{2}/
Using the braces to specify number of digits after the point makes it slightly clearer