scrapy/xpaths/regex: proper xpath/re to ignore "link interjections" - regex

I am scraping some korean language text and I come across a lot of "link interjection" for lack of a better word, where the html looks like this...
는 좋아요
it shows '저' as a hyperlink but the '는 좋아요' as regular text. They are in reality part of the same word object and display on the page as '저는 좋아요요' but when scraping using this xpath and regex...
foo = response.xpath('//*[#id="divID"]/p//text()').re(ur'[\uac00-\ud7af]+')
it is broken into two words in a list...
foo == ['저', '는', '좋아요']
How can I get this to keep it as one word, as was my original intent?
intended: foo == ['좋는', '좋아요']
EDIT: (comment response)
the problem with .join() is that it will join all the regularly scraped words as well as far as I can tell. So I would end up with this...
''.join(foo) == ['좋는좋아요']
So I do not think that .join() will work unless there is something I am missing

If you want to work on the string representation of an HTML element, XPath has a string() function that can be very helpful.
Once you have a single string for the element, you can apply regular expressions for words.
Here's a sample python interpreter session (I had to change your markup a bit to match the results you showed):
>>> import scrapy
>>>
>>> response = scrapy.Selector(text=u'<p>저는 좋아요</p>')
.//text() will select all descendant text nodes, as individual strings when .extract()ed (2 strings in this case):
>>> response.xpath('.//p//text()').extract()
[u'\uc800', u'\ub294 \uc88b\uc544\uc694']
And with the regex, you'll find 1 word, then 2 words:
>>> response.xpath('.//p//text()').re(ur'[\uac00-\ud7af]+')
[u'\uc800', u'\ub294', u'\uc88b\uc544\uc694']
>>> for e in response.xpath('.//p//text()').re(ur'[\uac00-\ud7af]+'):
... print e
...
저
는
좋아요
If you use XPath string() function on the paragraph element, you get a single string, even if the element has other children like a:
>>> response.xpath('string(.//p)').extract()
[u'\uc800\ub294 \uc88b\uc544\uc694']
>>> print response.xpath('string(.//p)').extract_first()
저는 좋아요
And you can then apply your regular expression to split on words:
>>> response.xpath('string(.//p)').re(ur'[\uac00-\ud7af]+')
[u'\uc800\ub294', u'\uc88b\uc544\uc694']
>>> for e in response.xpath('string(.//p)').re(ur'[\uac00-\ud7af]+'):
... print e
...
저는
좋아요
Note that string(node-set) only considers the first element in the node-set you pass as argument, so make sure your XPath expression first matches the element you want, or you can also chain XPath expression with scrapy selectors:
>>> for e in response.xpath('.//p').xpath('string(.)').re(ur'[\uac00-\ud7af]+'):
... print e
...
저는
좋아요

Related

Preprocess words that do not match list of words

I have a very specific case I'm trying to match: I have some text and a list of words (which may contain numbers, underscores, or ampersand), and I want to clean the text of numeric characters (for instance) unless it is a word in my list. This list is also long enough that I can't just make a regex that matches every one of the words.
I've tried to use regex to do this (i.e. doing something along the lines of re.sub(r'\d+', '', text), but trying to come up with a more complex regex to match my case. This obviously isn't quite working, as I don't think regex is meant to handle that kind of case.
I'm trying to experiment with other options like pyparsing, and tried something like the below, but this also gives me an error (probably because I'm not understanding pyparsing correctly):
from pyparsing import *
import re
phrases = ["76", "tw3nty", "potato_man", "d&"]
text = "there was once a potato_man with tw3nty cars and d& 76 different homes"
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(lambda word: re.sub(r'\d+', '', word)))
parser.parseString(text)
What's the best way to approach this sort of matching, or are there other better suited libraries that would be worth a try?
You are very close to getting this pyparsing cleaner-upper working.
Parse actions generally get their matched tokens as a list-like structure, a pyparsing-defined class called ParseResults.
You can see what actually gets sent to your parse action by wrapping it in the pyparsing decorator traceParseAction:
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(traceParseAction(lambda word: re.sub(r'\d+', '', word))))
Actually a little easier to read if you make your parse action a regular def'ed method instead of a lambda:
#traceParseAction
def unnumber(word):
return re.sub(r'\d+', '', word)
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(unnumber))
traceParseAction will report what is passed to the parse action and what is returned.
>>entering unnumber(line: 'there was once a potato_man with tw3nty cars and d& 76 different homes', 0, ParseResults(['there'], {}))
<<leaving unnumber (exception: expected string or bytes-like object)
You can see that the value passed in is in a list structure, so you should replace word in your call to re.sub with word[0] (I also modified your input string to add some numbers to the unguarded words, to see the parse action in action):
text = "there was 1once a potato_man with tw3nty cars and d& 76 different99 homes"
def unnumber(word):
return re.sub(r'\d+', '', word[0])
and I get:
['there', 'was', 'once', 'a', 'potato_man', 'with', 'tw3nty', 'cars', 'and', 'd&', '76', 'different', 'homes']
Also, you use the '^' operator for your parser. You may get a little better performance if you use the '|' operator instead, since '^' (which creates an Or instance) will evaluate all paths and choose the longest - necessary in cases where there is some ambiguity in what the alternatives might match. '|' creates a MatchFirst instance, which stops once it finds a match and does not look further for any alternatives. Since your first alternative is a list of the guard words, then '|' is actually more appropriate - if one gets matched, don't look any further.

How to find a word in a document that may be written in different forms? (Python)

I need to find the word "Judgment" or "Judgement" or "JUDGMENT" or "JUDGEMENT" or "J U D G M E N T" from a document or any permutation/combination of those characters in both upper/lower cases (in that particular order). Is there a regex function that could help me out?
The problem is, I am applying the code to different documents and every document contains a different form of that word. My code needs to recognize the word in all instances.
I just use your question as a string. Because it has all the combination you want and try this with you other combination. Leave a comment if this regex not worked.
>>> import re
>>>
>>> pattern = re.compile('(j[\s]*u[\s]*d[\s]*g[e|M|\s]*n[\s]*t)', re.IGNORECASE)
>>> string = """I need to find the word "Judgment" or "Judgement" or "JUDGMENT" or "JUDGEMENT" or "J U D G M E N T" from a document or any permutation/combination of those characters in both upper/lower cases (in that particular order). Is there a regex function that could help me out? The problem is, I am applying the code to different documents and every document contains a different form of that word. My code needs to recognize the word in all instances."""
>>>
>>> pattern.findall(string)
['Judgment', 'Judgement', 'JUDGMENT', 'JUDGEMENT', 'J U D G M E N T']
Here is the visualization of above regex.
You would probably want to preprocess your text data. Otherwise, it wouldn't be rational to do so considering the time-complexity of such regular expression, if even possible.
Permutation might be possible since the order of letters would remain the same, combination would be quite complicated, which would include words such as get, gem, Meg, and many others.
If you maybe want to have a very low boundary expression, maybe this expression would be OK to look into:
\b([judgment\s]+)\b
and here you can see how it would fail:
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Is there a regular expression for finding all question sentences from a webpage?

I am trying to extract some questions from a web site using BeautifulSoup, and want to use regular expression to get these questions from the web. Is my regular expression incorrect? And how can I combine soup.find_all with re.compile?
I have tried the following:
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
import urllib
import re
url = "https://www.sanfoundry.com/python-questions-answers-variable-names/"
headers = {'User-Agent':'Mozilla/5.0'}
page = requests.get(url)
soup = BeautifulSoup(page.text, "lxml")
a = soup.find_all("p")
for m in a:
print(m.get_text())
Now I have some text containing the questions like "1. Is Python case sensitive when dealing with identifiers?". I want to use r"[^.!?]+\?" to filter out the unwanted text, but I have the following error:
a = soup.find_all("p" : re.compile(r'[^.!?]+\?'))
a = soup.find_all("p" : re.compile(r'[^.!?]+\?'))
^
SyntaxError: invalid syntax
I checked my regular expression on https://regex101.com, it seems right. Is there a way to combine the regular expression and soup.find_all together?
One of methods to find p elements containig a ? it to
define a criterion function:
def criterion(tag):
return tag.name == 'p' and re.search('\?', tag.text)
and use it in find_all:
pars = soup.find_all(criterion)
But you want to print only questions, not the whole paragraphs
from pars.
To match these questions, define a pattern:
pat = re.compile(r'\d+\.\s[^?]+\?')
(a sequence of digits, a dot, a space, then a sequence of chars other
than ? and finally a ?).
Note that in general case one paragraph may contain multiple
questions. So the loop processing the paragraphs found should:
use findall to find all questions in the current paragraph
(the result is a list of found strings),
print also all of them, in separate lines, so you should
use join with a \n as a separator.
So the whole loop should be:
for m in pars:
questions = pat.findall(m.get_text())
print('\n'.join(questions))
Not a big regex fan, so tried this:
for q in a:
for i in q:
if '?' in i:
print(i)
Output:
1. Is Python case sensitive when dealing with identifiers?
2. What is the maximum possible length of an identifier?
3. Which of the following is invalid?
4. Which of the following is an invalid variable?
5. Why are local variable names beginning with an underscore discouraged?
6. Which of the following is not a keyword?
8. Which of the following is true for variable names in Python?
9. Which of the following is an invalid statement?
10. Which of the following cannot be a variable?

Using the contains function in xpath 1.0 to select numbers

I'm using scrapy and I need to scrape something like this: any number, followed by a dash, followed by any number, then a whitespace, then two letters (e.g. 1-3 mm). It seems xpath 1.0 does not allow the use of regex. Searching around, I've found some workarounds like using starts-with() and ends-with() but from what I've seen they only use it with letters. Please help.
Scrapy uses lxml internally, and lxml's XPath has support for regular expressions via EXSLT when you add the corresponding namespaces.
Scrapy does that by default so you can use re:test() within XPath expressions as a boolean for predicates.
boolean re:test(string, string, string?)
The re:test function returns true if the string given as the first argument matches the regular expression given as the second argument.
See this example Python2 session:
>>> import scrapy
>>> t = u"""<!DOCTYPE html>
... <html lang="en">
... <body>
... <p>ab-34mm</p>
... <p>102-d mm</p>
... <p>15-22 µm</p>
... <p>1-3 nm</p>
... </body>
... </html>"""
>>> selector = scrapy.Selector(text=t)
>>> selector.xpath(r'//p/text()[re:test(., "\d+-\d+\s\w{2}")]').extract()
[u'15-22 \xb5m', u'1-3 nm']
>>>
Edit: note on using EXSLT re:match
Using EXSLT re:match is a bit trickier, or at least less natural than re:test. re:match is similar to Python's re.match, which returns MatchObject
The signature is different from re:test:
object regexp:match(string, string, string?)
The regexp:match function returns a node set of match elements
So re:match will return <match> elements. To capture the string from these <match> elements, you need to use the function as the "outer" function, not inside predicates.
The following example chains XPath expressions,
selecting <p> paragraphs
then matching each paragraph string-value (normalized) with a regular expression containing parenthesized groups
finally extracting the result of these re:match calls
Python2 shell:
>>> for p in selector.xpath('//p'):
... print(p.xpath(ur're:match(normalize-space(.), "(\d+)-(\d+)\s(\w{2})")').extract())
...
[]
[]
[u'<match>15-22 \xb5m</match>', u'<match>15</match>', u'<match>22</match>', u'<match>\xb5m</match>']
[u'<match>1-3 nm</match>', u'<match>1</match>', u'<match>3</match>', u'<match>nm</match>']
>>>
To do this with xpath 1.0 you can use the translate function.
translate(#test , '1234567890', '..........') will replace any number (digit) with a dot.
If your numbers are always one digit you may try something like:
[translate(#test , '1234567890', '..........') = '.-. mm']
if the numbers could be longer than on digit you may try to replace numbers with nothing and test for - mm
[translate(#test , '1234567890', '') = '- mm']
But this can have some false trues. To avoid them you will need to check with substring-before -after length if there was at least one digit

Regular expression to group a pattern OR group empty string as ""

I'm using Python 3.3.2 with regular expressions. I have a pretty simple function
def DoRegexThings(somestring):
m = re.match(r'(^\d+)( .*$)?', somestring)
return m.group(1)
Which I am using to just get a numeric portion at the beginning of string, and discard the rest. However, it fails on the case of an empty string, since it is unable to match a group.
I've looked at this similar question which was asked previously, and changed my regular expression to this:
(^$)|(^\d+)( .*$)?
But it only causes it to return "None" every time, and still fails on empty strings. What I really want is a regular expression which I can use to either grab the numeric portion of my record, e.g. if the record is 1234 sometext, I just want 1234, or if the string is empty I want m.group(1) to return an empty string. My workaround right now is
m = re.match(r'(^\d+)( .*$)?', somestring)
if m == None: # Handle empty string case
return somestring
else:
return m.group(1)
But if I can avoid checking the match object for None, I'd like to. Is there a way to accomplish this?
I think you're making this overly complicated:
re.match(r"\d*", somestring).group()
will return a number if it's at the start of the string (.match() ensures this) or the empty string if there is no number.
>>> import re
>>> somestring = "987kjh"
>>> re.match(r"\d*", somestring).group()
'987'
>>> somestring = "kjh"
>>> re.match(r"\d*", somestring).group()
''