Selecting nodes with non-ASCII characters in Scrapy

Selecting nodes with non-ASCII characters in Scrapy - python-2.7

I have the following simple web scraper written in Scrapy:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpiderTest(BaseSpider):
name = 'MySpiderTest'
allowed_domains = ["boliga.dk"]
start_urls = ["http://www.boliga.dk/bbrinfo/3B71489C-AEA0-44CA-A0B2-7BD909B35618",]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = bbrItem()
print hxs.select("id('unitControl')/div[2]/table/tbody/tr[td//text()[contains(.,'Antal Badeværelser')]]/td[2]/text()").extract()
but when I run the spider I get the following syntax error:
SyntaxError: Non-ASCII character '\xe6' in file... on line 32, but no encoding declared
because of the æ in the xpath. The xpath is working in Xpath Checker for Firefox. I tried URL-encoding the æ, but that didn't work. What am I missing?
thanks!
UPDATE: I have added the encoding declaration in the beginning of the code (Latin-1 should support Danish characters)

Use a unicode string for your XPath expression
hxs.select(u"id('unitControl')/div[2]/table/tbody/tr[td//text()[contains(.,'Antal Badeværelser')]]/td[2]/text()").extract()
or
hxs.select(u"id('unitControl')/div[2]/table/tbody/tr[td//text()[contains(.,'Antal Badev\u00e6relser')]]/td[2]/text()").extract()
See Unicode Literals in Python Source Code

SyntaxError: Non-ASCII character ‘\xe2′ in file … on line 40,
but no decoding declared …
This is caused by the replacing standard characters like apostrophe (‘) by non-standard characters like quotation mark (`) during copying.
Try to edit the text copied from pdf.

repsonse.xpath("//tr[contains(., '" + u'中文字符' + "')]").extract()

Related

How can i clean urdu data corpus Python without nltk

I have a corpus of more that 10000 words in urdu. Now what i want is to clean my data. There appear a special uni coded data in my text like "!؟ـ،" whenever i use regular expressions it gives me error that your data is not in encoded form.
Kindly provide me some help to clean my data.
Thank you
Here is my sample data:
ظہیر
احمد
ماہرہ
خان
کی،
تصاویر،
نے
دائیں
اور
بائیں
والوں
کو
آسمانوں
پر
پہنچایا
،ہوا
ہے۔
دائیں؟
والے

I used your sample to find all words with ہ or ر
Notice that I had to tell python that I am dealing with utf-8 data by using u in front of the regex string as well as the data string
import re
data = u"""
ظہیر
احمد
ماہرہ
خان
.....
"""
result = re.findall(u'[^\s\n]+[ہر][^\s\n]+',data,re.MULTILINE)
print(result)
The output was
['ظہیر', 'ماہرہ', 'تصاویر،', 'پہنچایا', '،ہوا']
another example, removes all none alphabets except whitespace and makes sure only one whitespace separates the words
result = re.sub(' +',' ',re.sub(u'[\W\s]',' ',data))
print(result)
the output is
ظہیر احمد ماہرہ خان کی تصاویر نے دائیں اور بائیں والوں کو آسمانوں پر پہنچایا ہوا ہے دائیں والے
you can also use word tokanizer,
import nltk
result = nltk.tokenize.wordpunct_tokenize(data)
print(result)
the output will be
['ظہیر', 'احمد', 'ماہرہ'
, 'خان', 'کی', '،', 'تصاویر'
, '،', 'نے', 'دائیں', 'اور', 'بائیں', 'والوں'
, 'کو', 'آسمانوں', 'پر', 'پہنچایا'
, '،', 'ہوا', 'ہے', '۔', 'دائیں', '؟', 'والے']
Edit ... for Python 2.7 you have to specify the encoding at the beginning of the code file as well as telling re that the regex is 'unicode' using re.UNICODE
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import re
data = u"""ظہیر
احمد
ماہرہ
خان
کی،
.....
"""
result = re.sub(ur'\s+',u' ',re.sub(ur'[\W\s]',ur' ',data,re.UNICODE),re.UNICODE)
print(result)
also note the use of ur'' to specify the string is a unicode regex string

wolframalpha api syntax error

i am working on 'wolframalpha' api and i am keep getting this error, i tried to search but not getting any working post on this error if you know please help me to fix this error
File "jal.py", line 9
app_id=’PR5756-H3EP749GGH'
^
SyntaxError: invalid syntax
please help; i have to show project tomorrow :(
my code is
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import wolframalpha
import sys
app_id=’PR5756-H3EP749GGH'
client = wolframalpha.Client(app_id)
query = ‘ ‘.join(sys.argv[1:])
res = client.query(query)
if len(res.pods) > 0:
texts = “”
pod = res.pods[1]
if pod.text:
texts = pod.text
else:
texts = “I have no answer for that”
# to skip ascii character in case of error
texts = texts.encode(‘ascii’, ‘ignore’)
print texts
else:
print “Sorry, I am not sure.”

You used a backtick (´) instead of a single-quote (').
app_id='PR5756-H3EP749GGH'
Python directly shows you the error.
Also, use an editor with text highlighting.

Unicode and urllib.open

I am creating an application in python that can parse weather data from yr.no in Python. It works fine with regular ASCII strings, but fails when I use unicode.
def GetYRNOWeatherData(country, province, place):
#Parse the XML file
wtree = ET.parse(urllib.urlopen("http://www.yr.no/place/" + string.replace(country, ' ', '_').encode('utf-8') + "/" + string.replace(province, ' ', '_').encode('utf-8') + "/" + string.replace(place, ' ', '_').encode('utf-8') + "/forecast.xml"))
For example, when I try
GetYRNOWeatherData("France", "Île-de-France", "Paris")
I get this error
'charmap' codec can't encode character u'\xce' in position 0: character maps to <undefined>
Is it true that urllib doesn't handle unicode very well? Since I am using Tkinter as a frontend to this function, would that be the source of the problem (does the Tkinter Entry widget handle unicode well?)

You can handle this by keeping every string as a unicode right up until you actually make the urllib.urlopen request, at which point you encode to utf-8:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# This import makes all literal strings in the file default to
# type 'unicode' rather than type 'str'. You don't need to use this,
# but you'd need to do u"France" instead of just "France" below, and
# everywhere else you have a string literal.
from __future__ import unicode_literals
import urllib
import xml.etree.ElementTree as ET
def do_format(*args):
ret = []
for arg in args:
ret.append(arg.replace(" ", "_"))
return ret
def GetYRNOWeatherData(country, province, place):
country, province, place = do_format(country, province, place)
url = "http://www.yr.no/place/{}/{}/{}/forecast.xml".format(country, province, place)
wtree = ET.parse(urllib.urlopen(url.encode('utf-8')))
return wtree
if __name__ == "__main__":
GetYRNOWeatherData("France", "Île-de-France", "Paris")

Scrapy:: issue with encoding when dumping to the json file

Here is the web-site, I would like to parse: [web-site in russian][1]
Here is the code that extracts the info that I need:
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from flats.items import FlatsItem
class DmozSpider(Spider):
name = "dmoz"
start_urls = ['http://rieltor.ua/flats-sale/?ncrnd=6510']
def parse(self, response):
sel=Selector(response)
flats=sel.xpath('//*[#id="content"]')
flats_stored_info=[]
flat_item=FlatsItem()
for flat in flats:
flat_item['square']=[s.encode("utf-8") for s in sel.xpath('//div/strong[#class="param"][1]/text()').extract()]
flat_item['rooms_floor_floors']=[s.encode("utf-8") for s in sel.xpath('//div/strong[#class="param"][2]/text()').extract()]
flat_item['address']=[s.encode("utf-8") for s in flat.xpath('//*[#id="content"]//h2/a/text()').extract()]
flat_item['price']=[s.encode("utf-8") for s in flat.xpath('//div[#class="cost"]/strong/text()').extract()]
flat_item['subway']=[s.encode("utf-8") for s in flat.xpath('//span[#class="flag flag-location"]/a/text()').extract()]
flats_stored_info.append(flat_item)
return flats_stored_info
How I dump to json file
scrapy crawl dmoz -o items.json -t json
The problem is when I replace the code above to print in console the extracted info i.e. like this:
flat_item['square']=sel.xpath('//div/strong[#class="param"][1]/text()').extract()
for bla in flat_item['square']:
print bla
the script properly displays the information in russian.
But, when I use to dump the scraped information using the first version of the script (with encoding to utf-8), it writes to the json file something like this:
[{"square": ["2-\u043a\u043e\u043c\u043d., 16 \u044d\u0442\u0430\u0436 16-\u044d\u0442. \u0434\u043e\u043c", "1-\u043a\u043e\u043c\u043d.,
How can I dump information into json file in russian? Thank you for your advises.
[1]: http://rieltor.ua/flats-sale/?ncrnd=6510

It is correctly encoded, it's just that the json library escapes non-ascii characters by default.
You can load the data and use it (copying data from your example):
>>> import json
>>> print json.loads('"2-\u043a\u043e\u043c\u043d., 16 \u044d\u0442\u0430\u0436 16-\u044d\u0442. \u0434\u043e\u043c"')
2-комн., 16 этаж 16-эт. дом

django.http.HttpResponse does not deal unicode properly

following Tutorial 3, I have written this trivial views.py:
# coding = UTF-8
from django.http import HttpResponse
def index(request):
return HttpResponse( u"Seznam kontaktů" )
I tried also other tricks, such as using django.utils.encoding.smart_unicode(...), the u"%s" % ... trick, etc.
Whatever I try, I always get "Non-ASCII character" error:
SyntaxError at /kontakty/
Non-ASCII character '\xc5' in file C:\Users\JindrichVavruska\eclipse\workspace\ars\src\ars_site\party\views.py
on line 5, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details (views.py, line 5)
It is even more mysterious because I used a lot of national character strings in other files, such as models.py, e.g. text = models.CharField( u"Všechen text", max_length = 150), and there was absolutely no problem at all.
I found other answers on this site irrelevant, the suggested changes make no difference in my views.py
Jindra

It should be # -*- coding: utf-8 -*- not UTF-8. See PEP-263 for more details. You should also save the file as UTF-8. Check your editor's settings.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Selecting nodes with non-ASCII characters in Scrapy - python-2.7

SyntaxError: Non-ASCII character ‘\xe2′ in file … on line 40, but no decoding declared … This is caused by the replacing standard characters like apostrophe (‘) by non-standard characters like quotation mark (`) during copying. Try to edit the text copied from pdf.

repsonse.xpath("//tr[contains(., '" + u'中文字符' + "')]").extract()

Related

How can i clean urdu data corpus Python without nltk

wolframalpha api syntax error

Unicode and urllib.open

Scrapy:: issue with encoding when dumping to the json file

django.http.HttpResponse does not deal unicode properly

Categories

Resources