Python 2.7 - finding UTF-8 characters - python-2.7

from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-
quotes.html").read()
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
"\xe2\x80\x9c" is the UTF-8 character for curly quotes. When I try to find curly quotes in a website using this code, I get this error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2265:
ordinal not in range(128)
What does this error mean, what am I doing wrong, and how do I fix it?

You have to use decode('utf-8') to decode the string.
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read().decode('utf-8')
web = web.replace(b"\xe2\x80\x9c".decode('utf8'), '"')
print(web)

This is due to the Python 2 interpreter using the "ascii" codec as default for the string literals. In future code (Python 3) the default is utf-8 and you can have unicode literal characters in your code. You can do that now, with your Python 2, using a future import.
from __future__ import unicode_literals
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read()
web = web.decode("utf-8")
web = web.replace('“' , '"')
print(repr(web))

Note that this is a python 2 solution. Python 3 handles strings and bytes differently.
I can reproduce the problem with
>>> web = "0123\xe2\x80\x9c789"
>>> web.replace("\xe2\x80\x9c".decode('utf-8'), '"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
You read an encoded string into web and I just made a simpler one for test. When you decoded the search string, you created a unicode object. For the replacement to work, web needs to be converted to unicode.
>>> "\xe2\x80\x9c".decode('utf-8')
u'\u201c'
>>> unicode(web)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
It was the web conversion that got you. In python 2, str can hold encoded bytes - and that's exactly what you have here. One option is to just replace the encoded byte sequence
>>> web.replace("\xe2\x80\x9c", '"')
'0123"789'
This only works because you knew the page was encoded with utf-8. That is usually the case, but worth the mention.

Related

Is it possible to avoid old style '%' string formatting for this python string

I'm trying to remove all traces of old style string formatting in our python (2.7) code. However I've hit an example where only the old-style seems to work.
>>> x = u'\xa3'
>>> y = '{}'.format(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
And here's the code using old style which works ok
>>> y = '%s' % x
Is there a way of making this work using some form of { } syntax?
You can use a unicode literal instead:
y = u'{}'.format(x)

Jython 2.7.1 + ftfy 4.4

What can be wrong with this import?
I downloaded version 4.4 for Jython 2.7
import ftfy
import sys
print (ftfy.fix_encoding("н368вв777"))
Traceback (most recent call last):
File "D:/rs_al/IdeaProjects/XLStoSQL/src/main/java/BrokenUTF8.py", line 4,
in <module>
import ftfy
File "C:\jython\Lib\site-packages\ftfy\__init__.py", line 12, in <module>
from ftfy import fixes
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8:
illegal Unicode character
File "C:\jython\Lib\site-packages\ftfy\__init__.py", line 12, in <module>
from ftfy import fixes
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8: illegal Unicode character
With Python3 + ftfy 5 everything works, but I thought about using java + jython to convert wrong UTF8 characters with ftfy package and return data back to java.
Also, I set default decoding of source to UTF-8, because when I use jython 2.7 default decoding of sources is ascii.
At full power ftfy works only with Python 3. Moved project to Python. Solved

Error encoding/decoding already Unicode object python

Im using Python2.7
I have an unicode string like this:
s = u'Rub\xc3\xa9n'
I would like printing this:
print convert(s)
Rubén
I tried directly printing in several ways, but with not success:
print y
Rubén
print y.enconde('utf-8')
Rubén
print y.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128)
I know the form in which I declared the string is not the best, but other scripts are giving that format.
Thank you very much for help.
That is a Unicode string that was mis-decoded as latin1 or a similar encoding such as windows-1252, but was really utf8:
>>> s = 'Rub\xc3\xa9n'.decode('latin1')
>>> s
u'Rub\xc3\xa9n'
It should have been decoded as:
>>> s = 'Rub\xc3\xa9n'.decode('utf8')
>>> s
u'Rub\xe9n'
>>> print s
Rubén
If you don't have control of how the string was generated, you can undo the problem with:
>>> print u'Rub\xc3\xa9n'.encode('latin1').decode('utf8')
Rubén

Prettify() error using python 2.7

Code:
import urllib2
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.prettify())
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 7, in <module>
print(soup.prettify())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 8775: ordinal not in range(128)
[Finished in 2.4s with exit code 1]
I can't seem to get the error. I am using Python 2.7.9.
If you have a console as ASCII then during print, there is a conversion from unicode to ascii, and if there is character outside ASCII scope - exception is thrown.
But if console can accept unicode, then everything is correctly displayed.Try this command and run program again
export LANG=en_US.UTF-8

Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

I'm stuck here trying to unescape HTML special characters.
The problematic text is
Rudimental & Emeli Sandé
which should be converted to
Rudimental & Emeli Sandé
The text is downloaded via WGET (outside of python)
To test this, save a ANSI file with this line and import it.
import HTMLParser
trackentry = open('import.txt', 'r').readlines()
print(trackentry)
track = trackentry[0]
html_parser = HTMLParser.HTMLParser()
track = html_parser.unescape(track)
print(track)
I get this error when a line has é in it.
*pi#raspberrypi ~/scripting $ python unparse.py
['Rudimental & Emeli Sand\xe9\n']
Traceback (most recent call last):
File "unparse.py", line 9, in <module>
track = html_parser.unescape(track)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)*
The same code works fine under windows - I only have problems on the raspberry pi
running Python 2.7.3.
Python cannot decode 'é' ('\xe9') using the ASCII codec because this character is not 7-bit ASCII.
Your problem (condensed):
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
output = parser.unescape(input)
produces
Traceback (most recent call last):
File "problem.py", line 4, in <module>
output = parser.unescape(input)
File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)
HTMLParser.unescape() returns a unicode object, and therefore has to convert your input str. So it asks for the default encoding (which in your case is ASCII) and fails to interpret '\xe9' as an ASCII character (because it isn't). I guess your file encoding is ISO-8859-1 where '\xe9' is 'é'.
There are two easy solutions. Either you do the conversion manually:
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
input = input.decode('iso-8859-1')
output = parser.unescape(input)
or you use codecs.open() instead of open() whenever you are working with files:
import codecs
import HTMLParser
parser = HTMLParser.HTMLParser()
input = codecs.open("import.txt", encoding="iso-8859-1").readline()
output = parser.unescape(input)