encoding errors when replacing unwanted characters in utf-8 encoded file - python-2.7

Code can be downloaded here:
https://github.com/kelrien/pyretrieval/
whenever I execute my example.py, the following error pops up:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "example.py", line 21, in <module>
docs.append(proc.process(line.decode("utf-8")))
File "pyretrieval\processor.py", line 61, in process
tokens = self.tokenize(string)
File "pyretrieval\processor.py", line 47, in tokenize
temp = temp.replace(char, self.replace_characters[char])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128)
As you can see - the error happens when trying to replace german umlauts I specified. If I don't use the replace_characters dict and just ignore those umlauts, I'm not getting the error.
I already tried a lot of stuff:
Using the codecs module
Using encode("utf-8") and decode("utf-8") at different

I found a solution. I had to encode the characters I wanted to replace in unicode too(in processor.py).
I already pushed the necessary changes to github. https://github.com/kelrien/pyretrieval

Related

UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)

Help me figure out what's wrong with this. I am running Text summarization using Transformers
~/Bart_T5-summarization$ python app.py
No handlers could be found for logger "transformers.data.metrics"
Traceback (most recent call last):
File "app.py", line 6, in
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/init.py", line 42, in
from .tokenization_auto import AutoTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_auto.py", line 28, in
from .tokenization_xlm import XLMTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_xlm.py", line 27, in
import sacremoses as sm
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/init.py", line 2, in
from sacremoses.tokenize import *
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 16, in
class MosesTokenizer(object):
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 41, in MosesTokenizer
PAD_NOT_ISALNUM = r"([^{}\s.'`\,-])".format(IsAlnum), r" \1 "
enter image description here
UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)
Running the command with python3 instead of python solved this issue for me. I was able to run the code and obtain a summarization.

Trying to run Word Embeddings Benchmarks and getting UnicodeDecodeError

I am trying to run the Word Embeddings Benchmarks from this Github: Word Embeddings Benchmarks Github on word2vec embeddings I've created. I've included a picture of what my embedding file looks like.
I keep getting this error:
Traceback (most recent call last):
File "./evaluate_on_all.py", line 75, in <module>
load_kwargs=load_kwargs)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embeddings.py", line 39, in load_embedding
w = Embedding.from_word2vec(fname, binary=False)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embedding.py", line 482, in from_word2vec
words, vectors = Embedding._from_word2vec_text(fname)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embedding.py", line 340, in _from_word2vec_text
header = fin.readline()
File "/share/software/user/open/python/3.6.1/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 16: invalid start byte
I just want to be able to get the benchmarks to work properly with my embeddings.
Results of hexdump of header:
It looks like you're getting the error reading the very-first header line of the file (which suggests it's not something like a challenging word later on):
https://github.com/kudkudak/word-embeddings-benchmarks/blob/2b56c401ea4bba335ebfc0c8c5c4f8ba6394f2cd/web/embedding.py#L340
Are you sure that you're specifying the right plain-text file?
Might the file have extra hidden characters at the beginning, like the 'Byte Order Mark'? (Looking at hexdump -C YOUR_FILE_NAME | head could give a clue.)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xbf' in position 3: ordinal not in range(128)

I encounter an error running a script I wrote on an Ubuntu box with python 2.7.
It seems there is some issue with unicode yet I cannot figure it out.
I tried to encode the variables with UTF-8 yet im not even sure which one causes the error "str(count)" or "tag[u'Value']..
Traceback (most recent call last):
File "./AWS_collection_unix.py", line 105, in <module>
main()
File "./AWS_collection_unix.py", line 91, in main
ec2_tags_per_region(region, text_file)
File "./AWS_collection_unix.py", line 65, in ec2_tags_per_region
print_ids_and_tags(instance, text_file)
File "./AWS_collection_unix.py", line 16, in print_ids_and_tags
text_file.write('%s. %s' % (str(count), tag[u'Value']))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbf' in position 3: ordinal not in range(128)
The error doesnt specify in which parameter the error is..
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbf' in position 3: ordinal not in range(128)
How can I make this work appropriately?
Thanks
The shorties solution is to warp count with -> unicode(count).encode('utf-8'),
What will convert count to be utf-8 encoded str.
But the best way is to understand what is the encoding of the count variable.

UnicodeDecodeError: in python 2.7

I am working with the VirusTotal api, exactly with this:
https://www.virustotal.com/es/documentation/public-api/#scanning-files
This is the part of my scritp where im having problems:
def scanAFile(fileToScan):
host = "www.virustotal.com"
selector = "https://www.virustotal.com/vtapi/v2/file/scan"
fields = [("apikey", myPublicKey)]
file_to_send = open(fileToScan, "rb").read()
files = [("file", fileToScan, file_to_send)]
json = postfile.post_multipart(host, selector, fields, files)
return simplejson.loads(json)
With some files there is not any error and it runs fine, but the next error is occurring when trying to scan some files, for example this error is for a jpg file:
Traceback (most recent call last):
File "F:/devPy/myProjects/script_vt.py", line 138, in <module>
scanMyFile()
File "F:/devPy/myProjects/script_vt.py", line 75, in scanQueue
jsonScan = scanAFile(fileToScan)
File "F:/devPy/myProjects/script_vt.py", line 37, in scanAFile
json = postfile.post_multipart(host, selector, fields, files)
File "F:\devPy\myProjects\script_vt.py", line 10, in post_multipart
content_type, body = encode_multipart_formdata(fields, files)
File "F:\devPy\myProjects\script_vt.py", line 42, in encode_multipart_formdata
body = CRLF.join(L)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
I should indicate I work with pycharm under windows, could this cause the encoding error?
Any idea how to solve it? I got stack and couldnt find any solution on the net.

Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

I'm stuck here trying to unescape HTML special characters.
The problematic text is
Rudimental & Emeli Sandé
which should be converted to
Rudimental & Emeli Sandé
The text is downloaded via WGET (outside of python)
To test this, save a ANSI file with this line and import it.
import HTMLParser
trackentry = open('import.txt', 'r').readlines()
print(trackentry)
track = trackentry[0]
html_parser = HTMLParser.HTMLParser()
track = html_parser.unescape(track)
print(track)
I get this error when a line has é in it.
*pi#raspberrypi ~/scripting $ python unparse.py
['Rudimental & Emeli Sand\xe9\n']
Traceback (most recent call last):
File "unparse.py", line 9, in <module>
track = html_parser.unescape(track)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)*
The same code works fine under windows - I only have problems on the raspberry pi
running Python 2.7.3.
Python cannot decode 'é' ('\xe9') using the ASCII codec because this character is not 7-bit ASCII.
Your problem (condensed):
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
output = parser.unescape(input)
produces
Traceback (most recent call last):
File "problem.py", line 4, in <module>
output = parser.unescape(input)
File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)
HTMLParser.unescape() returns a unicode object, and therefore has to convert your input str. So it asks for the default encoding (which in your case is ASCII) and fails to interpret '\xe9' as an ASCII character (because it isn't). I guess your file encoding is ISO-8859-1 where '\xe9' is 'é'.
There are two easy solutions. Either you do the conversion manually:
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
input = input.decode('iso-8859-1')
output = parser.unescape(input)
or you use codecs.open() instead of open() whenever you are working with files:
import codecs
import HTMLParser
parser = HTMLParser.HTMLParser()
input = codecs.open("import.txt", encoding="iso-8859-1").readline()
output = parser.unescape(input)