Scrapy convert from unicode to utf-8 - python-2.7

I've wrote a simple script to extract data from some site. Script works as expected but I'm not pleased with output format
Here is my code
class ArticleSpider(Spider):
name = "article"
allowed_domains = ["example.com"]
start_urls = (
"http://example.com/tag/1/page/1"
)
def parse(self, response):
next_selector = response.xpath('//a[#class="next"]/#href')
url = next_selector[1].extract()
# url is like "tag/1/page/2"
yield Request(urlparse.urljoin("http://example.com", url))
item_selector = response.xpath('//h3/a/#href')
for url in item_selector.extract():
yield Request(urlparse.urljoin("http://example.com", url),
callback=self.parse_article)
def parse_article(self, response):
item = ItemLoader(item=Article(), response=response)
# here i extract title of every article
item.add_xpath('title', '//h1[#class="title"]/text()')
return item.load_item()
I'm not pleased with the output, something like:
[scrapy] DEBUG: Scraped from <200 http://example.com/tag/1/article_name>
{'title': [u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"']}
I think I need to use custom ItemLoader class but I don't know how. Need your help.
TL;DR I need to convert text, scraped by Scrapy from unicode to utf-8

As you can see below, this isn't much of a Scrapy issue but more of Python itself. It could also marginally be called an issue :)
$ scrapy shell http://censor.net.ua/resonance/267150/voobscheto_svoboda_zakanchivaetsya
In [7]: print response.xpath('//h1/text()').extract_first()
 "ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ"
In [8]: response.xpath('//h1/text()').extract_first()
Out[8]: u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"'
What you see is two different representations of the same thing - a unicode string.
What I would suggest is run crawls with -L INFO or add LOG_LEVEL='INFO' to your settings.py in order to not show this output in the console.
One annoying thing is that when you save as JSON, you get escaped unicode JSON e.g.
$ scrapy crawl example -L INFO -o a.jl
gives you:
$ cat a.jl
{"title": "\u00a0\"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f\""}
This is correct but it takes more space and most applications handle equally well non-escaped JSON.
Adding a few lines in your settings.py can change this behaviour:
from scrapy.exporters import JsonLinesItemExporter
class MyJsonLinesItemExporter(JsonLinesItemExporter):
def __init__(self, file, **kwargs):
super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)
FEED_EXPORTERS = {
'jsonlines': 'myproject.settings.MyJsonLinesItemExporter',
'jl': 'myproject.settings.MyJsonLinesItemExporter',
}
Essentially what we do is just setting ensure_ascii=False for the default JSON Item Exporters. This prevents escaping. I wish there was an easier way to pass arguments to exporters but I can't see any since they are initialized with their default arguments around here. Anyway, now your JSON file has:
$ cat a.jl
{"title": " \"ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ\""}
which is better-looking, equally valid and more compact.

There are 2 independant issues affecting display of unicode string.
if you return a list of strings, the output file will have some issue them because it will use ascii codec by default to serialize list elements. You can work around as below but it's more appropriate to use extract_first() as suggested by #neverlastn
class Article(Item):
title = Field(serializer=lambda x: u', '.join(x))
the default implementation of repr() method will serialize unicode string to their escaped version \uxxxx. You can change this behaviour by overriding this method in your item class
class Article(Item):
def __repr__(self):
data = self.copy()
for k in data.keys():
if type(data[k]) is unicode:
data[k] = data[k].encode('utf-8')
return super.__repr__(data)

Related

web Crawling and Extracting data using scrapy

I am new to python as well as scrapy.
I am trying to crawl a seed url https://www.health.com/patients/status/.This seed url contains many urls. But I want to fetch only urls that contain Faci/Details/#somenumber from the seed url .The url will be like below:
https://www.health.com/patients/status/ ->https://www.health.com/Faci/Details/2
-> https://www.health.com/Faci/Details/3
-> https://www.health.com/Faci/Details/4
https://www.health.com/Faci/Details/2 -> https://www.health.com/provi/details/64
-> https://www.health.com/provi/details/65
https://www.health.com/Faci/Details/3 -> https://www.health.com/provi/details/70
-> https://www.health.com/provi/details/71
Inside each https://www.health.com/Faci/Details/2 page there is https://www.health.com/provi/details/64
https://www.health.com/provi/details/65 ... .Finally I want to fetch some datas from
https://www.health.com/provi/details/#somenumber url.How can I achieve the same?
As of now I have tried the below code from scrapy tutorial and able to crawl only url that contains https://www.health.com/Faci/Details/#somenumber .Its not going to https://www.health.com/provi/details/#somenumber .I tried to set depth limit in settings.py file.But it doesn't worked.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from news.items import NewsItem
class MySpider(CrawlSpider):
name = 'provdetails.com'
allowed_domains = ['health.com']
start_urls = ['https://www.health.com/patients/status/']
rules = (
Rule(LinkExtractor(allow=('/Faci/Details/\d+', )), follow=True),
Rule(LinkExtractor(allow=('/provi/details/\d+', )),callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = NewsItem()
item['id'] = response.xpath("//title/text()").extract()
item['name'] = response.xpath("//title/text()").extract()
item['description'] = response.css('p.introduction::text').extract()
filename='details.txt'
with open(filename, 'wb') as f:
f.write(item)
self.log('Saved file %s' % filename)
return item
Please help me to proceed further?
To be honest, the regex-based and mighty Rule/LinkExtractor gave me often a hard time. For simple project it is maybe an approach to extract all links on page and then look on the href attribute. If the href matches your needs, yield a new Response object with it. For instance:
from scrapy.http import Request
from scrapy.selector import Selector
...
# follow links
for href in sel.xpath('//div[#class="contentLeft"]//div[#class="pageNavigation nobr"]//a').extract():
linktext = Selector(text=href).xpath('//a/text()').extract_first()
if linktext and linktext[0] == "Weiter":
link = Selector(text=href).xpath('//a/#href').extract()[0]
url = response.urljoin(link)
print url
yield Request(url, callback=self.parse)
Some remarks to your code:
response.xpath(...).extract()
This will return a list, maybe you want to have a look on extract_first() which provide the first item (or None).
with open(filename, 'wb') as f:
This will overwrite the file several times. You will only gain the last item saved. Also you open the file in binary mode ('b'). From the filename I guess you want to read it as text? Use 'a' to append? See open() docs
An alternative is to use the -o flag to use scrapys facilities to store the items to JSON or CSV.
return item
It is a good style to yield items instead of return them. At least if you need to create several items from one page you need to yield them.
Another good approach is: Use one parse() function for one type/kind of page.
For instance every page in start_urls fill end up in parse(). From that you extract could extract the links and yield Requests for each /Faci/Details/N page with a callback parse_faci_details(). In parse_faci_details() you extract again the links of interest, create Requests and pass them via callback= to e.g. parse_provi_details().
In this function you create the items you need.

Why does scrapy miss some links?

I am scraping the web-site "www.accell-group.com" using the "scrapy" library for Python. The site is scraped completely, in total 131 pages (text/html) and 2 documents (application/pdf) are identified. Scrapy did not throw any warnings or errors. My algorithm is supposed to scrape every single link. I use CrawlSpider.
However, when I look into the page "http://www.accell-group.com/nl/investor-relations/jaarverslagen/jaarverslagen-van-accell-group.htm", which is reported by "scrapy" as scraped/processed, I see that there are more pdf-documents, for example "http://www.accell-group.com/files/4/5/0/1/Jaarverslag2014.pdf". I cannot find any reasons for it not to be scraped. There is no dynamic/JavaScript content on this page. It is not forbidden in "http://www.airproducts.com/robots.txt".
Do you maybe have any idea why it can happen?
It is maybe because the "files" folder is not in "http://www.accell-group.com/sitemap.xml"?
Thanks in advance!
My code:
class PyscrappSpider(CrawlSpider):
"""This is the Pyscrapp spider"""
name = "PyscrappSpider"
def__init__(self, *a, **kw):
# Get the passed URL
originalURL = kw.get('originalURL')
logger.debug('Original url = {}'.format(originalURL))
# Add a protocol, if needed
startURL = 'http://{}/'.format(originalURL)
self.start_urls = [startURL]
self.in_redirect = {}
self.allowed_domains = [urlparse(i).hostname.strip() for i in self.start_urls]
self.pattern = r""
self.rules = (Rule(LinkExtractor(deny=[r"accessdenied"]), callback="parse_data", follow=True), )
# Get WARC writer
self.warcHandler = kw.get('warcHandler')
# Initialise the base constructor
super(PyscrappSpider, self).__init__(*a, **kw)
def parse_start_url(self, response):
if (response.request.meta.has_key("redirect_urls")):
original_url = response.request.meta["redirect_urls"][0]
if ((not self.in_redirect.has_key(original_url)) or (not self.in_redirect[original_url])):
self.in_redirect[original_url] = True
self.allowed_domains.append(original_url)
return self.parse_data(response)
def parse_data(self, response):
"""This function extracts data from the page."""
self.warcHandler.write_response(response)
pattern = self.pattern
# Check if we are interested in the current page
if (not response.request.headers.get('Referer')
or re.search(pattern, self.ensure_not_null(response.meta.get('link_text')), re.IGNORECASE)
or re.search(r"/(" + pattern + r")", self.ensure_not_null(response.url), re.IGNORECASE)):
logging.debug("This page gets processed = %(url)s", {'url': response.url})
sel = Selector(response)
item = PyscrappItem()
item['url'] = response.url
return item
else:
logging.warning("This page does NOT get processed = %(url)s", {'url': response.url})
return response.request
Remove or expand appropriately your "allowed_domains" variable and you should be fine. All the URLs the spider follows, by default, are restricted by allowed_domains.
EDIT: This case mentions particularly pdfs. PDFs are explicitly excluded as extensions as per the default value of deny_extensions (see here) which is IGNORED_EXTENSIONS (see here).
To allow your application to crawl PDFs all you have to do is to exclude them from IGNORED_EXTENSIONS by setting explicitly the value for deny_extensions:
from scrapy.linkextractors import IGNORED_EXTENSIONS
self.rules = (Rule(...
LinkExtractor(deny=[r"accessdenied"], deny_extensions=set(IGNORED_EXTENSIONS)-set(['pdf']))
..., callback="parse_data"...
So, I'm afraid, this is the answer to the question "Why does Scrapy miss some links?". As you will likely see it just opens the doors to further questions, like "how do I handle those PDFs" but I guess this is the subject of another question.

Xapian search terms which exceed the 245 character length: InvalidArgumentError: Term too long (> 245)

I'm using Xapian and Haystack in my django app. I have a model which contains a text field that I want to index for searching. This field is used to store all sorts of characters: words, urls, html, etc.
I'm using the default document-based index template:
text = indexes.CharField(document=True, use_template=True)
This sometimes yields the following error when someone has pasted a particularly long link:
InvalidArgumentError: Term too long (> 245)
Now I understand the error. I've gotten around it before for other fields in other situations.
My question is, what's the preferred way to handle this exception?
It seems that handling this exception requires me to use a prepare_text() method:
def prepare_text(self, obj):
content = []
for word in obj.body.split(' '):
if len(word) <= 245:
content += [word]
return ' '.join(content)
It just seems clunky and prone to problems. Plus I can't use the search templates.
How have you handled this problem?
I think you get it right. There's a patch on inkscape xapian_backend fork, inspired from xapian omega project.
I've done something like you've done on my project, with some trick in order to use the search index template:
# regex to efficiently truncate with re.sub
_max_length = 240
_regex = re.compile(r"([^\s]{{{}}})([^\s]+)".format(_max_length))
def prepare_text(self, object):
# this is for using the template mechanics
field = self.fields["text"]
text = self.prepared_data[field.index_fieldname]
encoding = "utf8"
encoded = text.encode(encoding)
prepared = re.sub(_regex, r"\1", encoded, re.UNICODE)
if len(prepared) != len(encoded):
return prepared.decode(encoding, 'ignore')
return text

grep/sed/awk - extract substring from html code

i want to get a value from html code like this:
<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:
As result i need just the value: "53"
How can this be done using linux command line tools like grep, awk or sed? I want to use it on a raspberry pi ...r
Trying this doesnt work:
root#raspberrypi:/home/pi# echo "<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:" >> test.txt
root#raspberrypi:/home/pi# grep -oP '<span id="wob_hm">\K[0-9]+(?=%</span>)' test.txt
root#raspberrypi:/home/pi#
Because HTML is not a flat-text format, handling it with flat-text tools such as grep, sed or awk is not advisable. If the format of the HTML changes slightly (for example: if the span node gets another attribute or newlines are inserted somewhere), anything you build this way will have a tendency to break.
It is more robust (if more laborious) to use something that is built to parse HTML. In this case, I'd consider using Python because it has a (rudimentary) HTML parser in its standard library. It could look roughly like this:
#!/usr/bin/python3
import html.parser
import re
import sys
# html.parser.HTMLParser provides the parsing functionality. It tokenizes
# the HTML into tags and what comes between them, and we handle them in the
# order they appear. With XML we would have nicer facilities, but HTML is not
# a very good format, so we're stuck with this.
class my_parser(html.parser.HTMLParser):
def __init__(self):
super(my_parser, self).__init__(self)
self.data = ''
self.depth = 0
# handle opening tags. Start counting, assembling content when a
# span tag begins whose id is "wob_hm". A depth counter is maintained
# largely to handle nested span tags, which is not strictly necessary
# in your case (but will make this easier to adapt for other things and
# is not more complicated to implement than a flag)
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
# handle end tags. Make sure the depth counter is only positive
# as long as we're in the span tag we want
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
# when data comes, assemble it in a string. Note that nested tags would
# not be recorded by this if they existed. It would be more work to
# implement that, and you don't need it for this.
def handle_data(self, data):
if self.depth > 0:
self.data += data
# open the file whose name is the first command line argument. Do so as
# binary to get bytes from f.read() instead of a string (which requires
# the data to be UTF-8-encoded)
with open(sys.argv[1], "rb") as f:
# instantiate our parser
p = my_parser()
# then feed it the file. If the file is not UTF-8, it is necessary to
# convert the file contents to UTF-8. I'm assuming latin1-encoded
# data here; since the example looks German, "latin9" might also be
# appropriate. Use the encoding in which your data is encoded.
p.feed(f.read().decode("latin1"))
# trim (in case of newlines/spaces around the data), remove % at the end,
# then print
print(re.compile('%$').sub('', p.data.strip()))
Addendum: Here's a backport to Python 2 that bulldozes right over encoding problems. For this case, that is arguably nicer because encoding doesn't matter for the data we want to extract and you don't have to know the encoding of the input file in advance. The changes are minor, and the way it works is exactly the same:
#!/usr/bin/python
from HTMLParser import HTMLParser
import re
import sys
class my_parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = ''
self.depth = 0
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
def handle_data(self, data):
if self.depth > 0:
self.data += data
with open(sys.argv[1], "r") as f:
p = my_parser()
p.feed(f.read())
print(re.compile('%$').sub('', p.data.strip()))

Django smart_str on queryset

I need to use smart_str on the results of a query in my view to take care of latin characters. How can I convert each item in my queryset?
I have tried:
...
mylist = []
myquery_set = Locality.objects.all()
for item in myquery_set:
mylist.append(smart_str(item))
...
But I get the error:
coercing to Unicode: need string or buffer, <object> found
What is the best way to do this? Or can I take care of it in the template as I iterate the results?
EDIT: if I output the values to a template then all is good. However, I want to output the response as an .xls file using the code:
...
filename = "locality.xls"
response['Content-Disposition'] = 'attachment; filename='+filename
response['Content-Type'] = 'application/vnd.ms-excel; charset=utf-8'
return response
The view works fine (gives me the file etc.) but the latin characters are not rendered properly.
In your code you're executing smart_str on Model object, instead of a string (so basically you're trying to convert object to string). The solution is to smart_str on a field:
mylist.append(smart_str(item.fieldname))