Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi - python-2.7

I'm stuck here trying to unescape HTML special characters.
The problematic text is
Rudimental & Emeli Sandé
which should be converted to
Rudimental & Emeli Sandé
The text is downloaded via WGET (outside of python)
To test this, save a ANSI file with this line and import it.
import HTMLParser
trackentry = open('import.txt', 'r').readlines()
print(trackentry)
track = trackentry[0]
html_parser = HTMLParser.HTMLParser()
track = html_parser.unescape(track)
print(track)
I get this error when a line has é in it.
*pi#raspberrypi ~/scripting $ python unparse.py
['Rudimental & Emeli Sand\xe9\n']
Traceback (most recent call last):
File "unparse.py", line 9, in <module>
track = html_parser.unescape(track)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)*
The same code works fine under windows - I only have problems on the raspberry pi
running Python 2.7.3.

Python cannot decode 'é' ('\xe9') using the ASCII codec because this character is not 7-bit ASCII.
Your problem (condensed):
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
output = parser.unescape(input)
produces
Traceback (most recent call last):
File "problem.py", line 4, in <module>
output = parser.unescape(input)
File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)
HTMLParser.unescape() returns a unicode object, and therefore has to convert your input str. So it asks for the default encoding (which in your case is ASCII) and fails to interpret '\xe9' as an ASCII character (because it isn't). I guess your file encoding is ISO-8859-1 where '\xe9' is 'é'.
There are two easy solutions. Either you do the conversion manually:
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
input = input.decode('iso-8859-1')
output = parser.unescape(input)
or you use codecs.open() instead of open() whenever you are working with files:
import codecs
import HTMLParser
parser = HTMLParser.HTMLParser()
input = codecs.open("import.txt", encoding="iso-8859-1").readline()
output = parser.unescape(input)

Related

UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)

Help me figure out what's wrong with this. I am running Text summarization using Transformers
~/Bart_T5-summarization$ python app.py
No handlers could be found for logger "transformers.data.metrics"
Traceback (most recent call last):
File "app.py", line 6, in
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/init.py", line 42, in
from .tokenization_auto import AutoTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_auto.py", line 28, in
from .tokenization_xlm import XLMTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_xlm.py", line 27, in
import sacremoses as sm
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/init.py", line 2, in
from sacremoses.tokenize import *
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 16, in
class MosesTokenizer(object):
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 41, in MosesTokenizer
PAD_NOT_ISALNUM = r"([^{}\s.'`\,-])".format(IsAlnum), r" \1 "
enter image description here
UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)
Running the command with python3 instead of python solved this issue for me. I was able to run the code and obtain a summarization.

apache beam 2.7.0 craches in utf-8 decoding french characters

I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding.
After trying encoding and decoding from "latin-1" to "utf-8" without success (using unicode, unicodedata and codecs) I tried to change things manually...
The Os I am using, has the "ascii" encoding by default and I manually changed in "Anaconda3/envs/py27/lib/site.py" into utf-8.
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init()
sys.setdefaultencoding("utf-8")
I've tried locally with a test file, by printing and then writing a string with accents into a file, and it worked!
string='naïve café'
test_decode=codecs.utf_8_decode(string, "strict", True)[0]
print(test_decode)
with open('./test.txt', 'w') as outfile:
outfile.write(test_decode)
But no luck with apache_beam...
Then I've tried to manually change "/usr/lib/python2.7/encodings/utf_8.py" and put "ignore" instead of "strict" into codecs.utf_8_decode
def decode(input, errors='ignore'):
return codecs.utf_8_decode(input, errors, True)
but I've realized that apache_beam do not use this file or at least does not take it into account any changes
Any ideas how to deal with it?
Please find below the error message
Traceback (most recent call last):
File "etablissementsFiness.py", line 146, in <module>
dataflow(run_locally)
File "etablissementsFiness.py", line 140, in dataflow
| 'Write entities into Datastore' >> WriteToDatastore(PROJECT)
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\pipel
ine.py", line 414, in __exit__
self.run().wait_until_finish()
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\runne
rs\dataflow\dataflow_runner.py", line 1148, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow
pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 642, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", lin
e 156, in execute
op.start()
File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
def start(self):
File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.scoped_start_state:
File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.spec.source.reader() as reader:
File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
for value in reader:
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 2
01, in read_records
yield self._coder.decode(record)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/coders/coders.py", li
ne 307, in decode
return value.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 190: invalid continuation byte
Try to write a CustomCoder class and "ignore" any errors while decoding:
from apache_beam.coders.coders import Coder
class CustomCoder(Coder):
"""A custom coder used for reading and writing strings as UTF-8."""
def encode(self, value):
return value.encode("utf-8", "replace")
def decode(self, value):
return value.decode("utf-8", "ignore")
def is_deterministic(self):
return True
Then, read and write the files using the coder=CustomCoder():
lines = p | "Read" >> ReadFromText("files/path/*.txt", coder=CustomCoder())
# More processing code here...
output | WriteToText("output/file/path", file_name_suffix=".txt", coder=CustomCoder())
This error: "UnicodeDecodeError: 'utf8' codec can't decode byte"
means, that you CSV file still contains some wrong bytes not recognized by the decoder as UTF characters.
The easiest solution for this, is to convert and validate csv input file to not contain UTF8 errors before submitting for Datastore. Simple online UTF8 validation can check it.
If you need to convert latin-1 to UTF8 in python, you can do it like that:
string.decode('iso-8859-1').encode('utf8')

Python 2.7 - finding UTF-8 characters

from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-
quotes.html").read()
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
"\xe2\x80\x9c" is the UTF-8 character for curly quotes. When I try to find curly quotes in a website using this code, I get this error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2265:
ordinal not in range(128)
What does this error mean, what am I doing wrong, and how do I fix it?
You have to use decode('utf-8') to decode the string.
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read().decode('utf-8')
web = web.replace(b"\xe2\x80\x9c".decode('utf8'), '"')
print(web)
This is due to the Python 2 interpreter using the "ascii" codec as default for the string literals. In future code (Python 3) the default is utf-8 and you can have unicode literal characters in your code. You can do that now, with your Python 2, using a future import.
from __future__ import unicode_literals
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read()
web = web.decode("utf-8")
web = web.replace('“' , '"')
print(repr(web))
Note that this is a python 2 solution. Python 3 handles strings and bytes differently.
I can reproduce the problem with
>>> web = "0123\xe2\x80\x9c789"
>>> web.replace("\xe2\x80\x9c".decode('utf-8'), '"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
You read an encoded string into web and I just made a simpler one for test. When you decoded the search string, you created a unicode object. For the replacement to work, web needs to be converted to unicode.
>>> "\xe2\x80\x9c".decode('utf-8')
u'\u201c'
>>> unicode(web)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
It was the web conversion that got you. In python 2, str can hold encoded bytes - and that's exactly what you have here. One option is to just replace the encoded byte sequence
>>> web.replace("\xe2\x80\x9c", '"')
'0123"789'
This only works because you knew the page was encoded with utf-8. That is usually the case, but worth the mention.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position

When I try to extract some pattern from a tagged text in nltk, I have the error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128). Firstly I had not this error, but I got it only after installing some packages.
this is the code:
# -*- coding: utf-8 -*-
import codecs
import sys
import re
import sys
import nltk
from nltk.corpus import *
k = nltk.corpus.brown.tagged_words('myfile')
for (w1,t1), (w2,t2) in nltk.bigrams(k):
if t1 == 'NN' and t2 == 'AJ':
print w1, w2
this is the entire output of the code.
Traceback (most recent call last):
File "/home/fathi/egfe.py", line 12, in <module>
for (w1,t1), (w2,t2) in nltk.bigrams(k):
File "/usr/local/lib/python2.7/dist-packages/nltk/util.py", line 442, in bigrams
for item in ngrams(sequence, 2, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/nltk/util.py", line 419, in ngrams
history.append(next(sequence))
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
tokens = self.read_block(self._stream)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/tagged.py", line 241, in read_block
for para_str in self._para_block_reader(stream):
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 564, in read_blankline_block
line = stream.readline()
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1095, in readline
new_chars = self._read(readsize)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1322, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1352, in _incr_decode
return self.decode(bytes, 'strict')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128)
The problem is that the ntlk version is not compatabile with the python version, so it requires an older version of the nltk toolkit.

encoding errors when replacing unwanted characters in utf-8 encoded file

Code can be downloaded here:
https://github.com/kelrien/pyretrieval/
whenever I execute my example.py, the following error pops up:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "example.py", line 21, in <module>
docs.append(proc.process(line.decode("utf-8")))
File "pyretrieval\processor.py", line 61, in process
tokens = self.tokenize(string)
File "pyretrieval\processor.py", line 47, in tokenize
temp = temp.replace(char, self.replace_characters[char])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128)
As you can see - the error happens when trying to replace german umlauts I specified. If I don't use the replace_characters dict and just ignore those umlauts, I'm not getting the error.
I already tried a lot of stuff:
Using the codecs module
Using encode("utf-8") and decode("utf-8") at different
I found a solution. I had to encode the characters I wanted to replace in unicode too(in processor.py).
I already pushed the necessary changes to github. https://github.com/kelrien/pyretrieval