Prettify() error using python 2.7 - python-2.7

Code:
import urllib2
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.prettify())
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 7, in <module>
print(soup.prettify())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 8775: ordinal not in range(128)
[Finished in 2.4s with exit code 1]
I can't seem to get the error. I am using Python 2.7.9.

If you have a console as ASCII then during print, there is a conversion from unicode to ascii, and if there is character outside ASCII scope - exception is thrown.
But if console can accept unicode, then everything is correctly displayed.Try this command and run program again
export LANG=en_US.UTF-8

Related

UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)

Help me figure out what's wrong with this. I am running Text summarization using Transformers
~/Bart_T5-summarization$ python app.py
No handlers could be found for logger "transformers.data.metrics"
Traceback (most recent call last):
File "app.py", line 6, in
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/init.py", line 42, in
from .tokenization_auto import AutoTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_auto.py", line 28, in
from .tokenization_xlm import XLMTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_xlm.py", line 27, in
import sacremoses as sm
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/init.py", line 2, in
from sacremoses.tokenize import *
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 16, in
class MosesTokenizer(object):
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 41, in MosesTokenizer
PAD_NOT_ISALNUM = r"([^{}\s.'`\,-])".format(IsAlnum), r" \1 "
enter image description here
UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)
Running the command with python3 instead of python solved this issue for me. I was able to run the code and obtain a summarization.

Python 2.7 import unicode_literals from __future__ gives UnicodeDecodeError while reading the file with umauts

I have a Python script which read and write a file with german umlauts (äöü) in an input file "myfile.in". I used Python version 2.7. Here a reduced version of my script:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
if __name__=='__main__':
with open("myfile.in", "r") as f:
lines = f.readlines()
txt = ""
for line in lines:
txt = txt + line
with open("myfile.out", "w") as f:
f.write(txt)
This works fine.
Now I got the requirement from my customer to used the Future statement definitions and I added the following line to my Python script:
from __future__ import unicode_literals
Now I get the following error message:
Traceback (most recent call last):
File "myscript.py", line 9, in <module>
txt = txt + line
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 23: ordinal not in range(128)
How can I resolve this problem.
Thanks for your hints Thomas

Jython 2.7.1 + ftfy 4.4

What can be wrong with this import?
I downloaded version 4.4 for Jython 2.7
import ftfy
import sys
print (ftfy.fix_encoding("н368вв777"))
Traceback (most recent call last):
File "D:/rs_al/IdeaProjects/XLStoSQL/src/main/java/BrokenUTF8.py", line 4,
in <module>
import ftfy
File "C:\jython\Lib\site-packages\ftfy\__init__.py", line 12, in <module>
from ftfy import fixes
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8:
illegal Unicode character
File "C:\jython\Lib\site-packages\ftfy\__init__.py", line 12, in <module>
from ftfy import fixes
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8: illegal Unicode character
With Python3 + ftfy 5 everything works, but I thought about using java + jython to convert wrong UTF8 characters with ftfy package and return data back to java.
Also, I set default decoding of source to UTF-8, because when I use jython 2.7 default decoding of sources is ascii.
At full power ftfy works only with Python 3. Moved project to Python. Solved

Python 2.7 - finding UTF-8 characters

from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-
quotes.html").read()
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
"\xe2\x80\x9c" is the UTF-8 character for curly quotes. When I try to find curly quotes in a website using this code, I get this error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2265:
ordinal not in range(128)
What does this error mean, what am I doing wrong, and how do I fix it?
You have to use decode('utf-8') to decode the string.
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read().decode('utf-8')
web = web.replace(b"\xe2\x80\x9c".decode('utf8'), '"')
print(web)
This is due to the Python 2 interpreter using the "ascii" codec as default for the string literals. In future code (Python 3) the default is utf-8 and you can have unicode literal characters in your code. You can do that now, with your Python 2, using a future import.
from __future__ import unicode_literals
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read()
web = web.decode("utf-8")
web = web.replace('“' , '"')
print(repr(web))
Note that this is a python 2 solution. Python 3 handles strings and bytes differently.
I can reproduce the problem with
>>> web = "0123\xe2\x80\x9c789"
>>> web.replace("\xe2\x80\x9c".decode('utf-8'), '"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
You read an encoded string into web and I just made a simpler one for test. When you decoded the search string, you created a unicode object. For the replacement to work, web needs to be converted to unicode.
>>> "\xe2\x80\x9c".decode('utf-8')
u'\u201c'
>>> unicode(web)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
It was the web conversion that got you. In python 2, str can hold encoded bytes - and that's exactly what you have here. One option is to just replace the encoded byte sequence
>>> web.replace("\xe2\x80\x9c", '"')
'0123"789'
This only works because you knew the page was encoded with utf-8. That is usually the case, but worth the mention.

Error encoding/decoding already Unicode object python

Im using Python2.7
I have an unicode string like this:
s = u'Rub\xc3\xa9n'
I would like printing this:
print convert(s)
Rubén
I tried directly printing in several ways, but with not success:
print y
Rubén
print y.enconde('utf-8')
Rubén
print y.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128)
I know the form in which I declared the string is not the best, but other scripts are giving that format.
Thank you very much for help.
That is a Unicode string that was mis-decoded as latin1 or a similar encoding such as windows-1252, but was really utf8:
>>> s = 'Rub\xc3\xa9n'.decode('latin1')
>>> s
u'Rub\xc3\xa9n'
It should have been decoded as:
>>> s = 'Rub\xc3\xa9n'.decode('utf8')
>>> s
u'Rub\xe9n'
>>> print s
Rubén
If you don't have control of how the string was generated, you can undo the problem with:
>>> print u'Rub\xc3\xa9n'.encode('latin1').decode('utf8')
Rubén