apache beam 2.7.0 craches in utf-8 decoding french characters - python-2.7

I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding.
After trying encoding and decoding from "latin-1" to "utf-8" without success (using unicode, unicodedata and codecs) I tried to change things manually...
The Os I am using, has the "ascii" encoding by default and I manually changed in "Anaconda3/envs/py27/lib/site.py" into utf-8.
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init()
sys.setdefaultencoding("utf-8")
I've tried locally with a test file, by printing and then writing a string with accents into a file, and it worked!
string='naïve café'
test_decode=codecs.utf_8_decode(string, "strict", True)[0]
print(test_decode)
with open('./test.txt', 'w') as outfile:
outfile.write(test_decode)
But no luck with apache_beam...
Then I've tried to manually change "/usr/lib/python2.7/encodings/utf_8.py" and put "ignore" instead of "strict" into codecs.utf_8_decode
def decode(input, errors='ignore'):
return codecs.utf_8_decode(input, errors, True)
but I've realized that apache_beam do not use this file or at least does not take it into account any changes
Any ideas how to deal with it?
Please find below the error message
Traceback (most recent call last):
File "etablissementsFiness.py", line 146, in <module>
dataflow(run_locally)
File "etablissementsFiness.py", line 140, in dataflow
| 'Write entities into Datastore' >> WriteToDatastore(PROJECT)
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\pipel
ine.py", line 414, in __exit__
self.run().wait_until_finish()
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\runne
rs\dataflow\dataflow_runner.py", line 1148, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow
pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 642, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", lin
e 156, in execute
op.start()
File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
def start(self):
File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.scoped_start_state:
File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.spec.source.reader() as reader:
File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
for value in reader:
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 2
01, in read_records
yield self._coder.decode(record)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/coders/coders.py", li
ne 307, in decode
return value.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 190: invalid continuation byte

Try to write a CustomCoder class and "ignore" any errors while decoding:
from apache_beam.coders.coders import Coder
class CustomCoder(Coder):
"""A custom coder used for reading and writing strings as UTF-8."""
def encode(self, value):
return value.encode("utf-8", "replace")
def decode(self, value):
return value.decode("utf-8", "ignore")
def is_deterministic(self):
return True
Then, read and write the files using the coder=CustomCoder():
lines = p | "Read" >> ReadFromText("files/path/*.txt", coder=CustomCoder())
# More processing code here...
output | WriteToText("output/file/path", file_name_suffix=".txt", coder=CustomCoder())

This error: "UnicodeDecodeError: 'utf8' codec can't decode byte"
means, that you CSV file still contains some wrong bytes not recognized by the decoder as UTF characters.
The easiest solution for this, is to convert and validate csv input file to not contain UTF8 errors before submitting for Datastore. Simple online UTF8 validation can check it.
If you need to convert latin-1 to UTF8 in python, you can do it like that:
string.decode('iso-8859-1').encode('utf8')

Related

Trying to run Word Embeddings Benchmarks and getting UnicodeDecodeError

I am trying to run the Word Embeddings Benchmarks from this Github: Word Embeddings Benchmarks Github on word2vec embeddings I've created. I've included a picture of what my embedding file looks like.
I keep getting this error:
Traceback (most recent call last):
File "./evaluate_on_all.py", line 75, in <module>
load_kwargs=load_kwargs)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embeddings.py", line 39, in load_embedding
w = Embedding.from_word2vec(fname, binary=False)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embedding.py", line 482, in from_word2vec
words, vectors = Embedding._from_word2vec_text(fname)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embedding.py", line 340, in _from_word2vec_text
header = fin.readline()
File "/share/software/user/open/python/3.6.1/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 16: invalid start byte
I just want to be able to get the benchmarks to work properly with my embeddings.
Results of hexdump of header:
It looks like you're getting the error reading the very-first header line of the file (which suggests it's not something like a challenging word later on):
https://github.com/kudkudak/word-embeddings-benchmarks/blob/2b56c401ea4bba335ebfc0c8c5c4f8ba6394f2cd/web/embedding.py#L340
Are you sure that you're specifying the right plain-text file?
Might the file have extra hidden characters at the beginning, like the 'Byte Order Mark'? (Looking at hexdump -C YOUR_FILE_NAME | head could give a clue.)

Getting error 'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)' when rebuilding haystack index

I have an application where I have to store people's names and make them searchable. The technologies I am using are python (v2.7.6) django (v1.9.5) rest framwork. The dbms is postgresql (v9.2). Since the user names can be arabic we are using utf-8 as db encoding. For search we are using haystack (v2.4.1) with Amazon Elastic Search for indexing. The index was building fine a few days ago but now when I try to rebuild it with
python manage.py rebuild_index
it fails with the following error
'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)
The full error trace is
File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 188, in handle_label
self.update_backend(label, using)
File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 233, in update_backend
do_update(backend, index, qs, start, end, total, verbosity=self.verbosity, commit=self.commit)
File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 96, in do_update
backend.update(index, current_qs, commit=commit)
File "/usr/local/lib/python2.7/dist-packages/haystack/backends/elasticsearch_backend.py", line 193, in update
bulk(self.conn, prepped_docs, index=self.index_name, doc_type='modelresult')
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 188, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 160, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 85, in _process_bulk_chunk
resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/__init__.py", line 795, in bulk
doc_type, '_bulk'), params=params, body=self._bulk_body(body))
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 329, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 68, in perform_request
response = self.session.request(method, url, data=body, timeout=timeout or self.timeout)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 330, in send
timeout=timeout
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 558, in urlopen
body=body, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 353, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python2.7/httplib.py", line 979, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 1013, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 833, in _send_output
msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)
My guess is that befor we didn't have arabic characters in our database so the index was building fine but now since users have entered arabic chars the index fails to build.
If you are using the requests-aws4auth package, then you can use the following wrapper class in place of the AWS4Auth class. It encodes the headers created by AWS4Auth into byte strings thus avoiding the UnicodeDecodeError downstream.
from requests_aws4auth import AWS4Auth
class AWS4AuthEncodingFix(AWS4Auth):
def __call__(self, request):
request = super(AWS4AuthEncodingFix, self).__call__(request)
for header_name in request.headers:
self._encode_header_to_utf8(request, header_name)
return request
def _encode_header_to_utf8(self, request, header_name):
value = request.headers[header_name]
if isinstance(value, unicode):
value = value.encode('utf-8')
if isinstance(header_name, unicode):
del request.headers[header_name]
header_name = header_name.encode('utf-8')
request.headers[header_name] = value
I suspect you're correct about the arabic chars now showing up in the DB.
https://github.com/elastic/elasticsearch-py/issues/392
https://github.com/django-haystack/django-haystack/issues/1072
are also possibly related to this issue. The first link seems to have some kind of work around for it, but doesn't have a lot of detail. I suspect what the author meant with
The proper fix is to use unicode type instead of str or set the default encoding properly to (I assume) utf-8.
is that you need to check that the the machine it's running on is LANG=en_US.UTF-8 or at least some UTF-8 LANG
Elasticsearch supports different encoding so having arabic characters shouldn't be the problem.
Since you are using AWS, I will assume you also use some authorization library like requests-aws4auth.
If that is the case, notice that during authorization, some unicode headers are added, like u'x-amz-date'. That is a problem, since python's httplib perfoms the following during _send_output(): msg = "\r\n".join(self._buffer) where _buffer is a list of the HTTP headers. Having unicode headers makes msg be of <type 'unicode'> while it really should be of type str (Here is a similar issue with different auth library).
The line that raises the exception, msg += message_body raises it since python needs to decode message_body to unicode so it matches the type of msg. The exception is rised since py-elasticsearch already took care of the encoding, so we end up of encoding to unicode twice, which cause the exception (as explained here).
You may want to try to replace the auth library (for example with DavidMuller/aws-requests-auth) and see if it fixes the problem.

pandas reading .csv files

I have a small script to read and print a .csv file using pandas generated from MS Excel.
import pandas as pd
data = pd.read_csv('./2010-11.csv')
print(data)
now this script runs in Python 2.7.8 but in Python 3.4.1 gives the following
error. Any ideas why this might be so? Thanks in advance for any help with this.
Traceback (most recent call last):
File "proc_csv_0-0.py", line 3, in <module>
data = pd.read_csv('./2010-11.csv')
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 260, in _read
return parser.read()
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 721, in read
ret = self._engine.read(nrows)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 1170, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7566)
File "pandas/parser.pyx", line 791, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7806)
File "pandas/parser.pyx", line 866, in pandas.parser.TextReader._read_rows (pandas/parser.c:8639)
File "pandas/parser.pyx", line 973, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9950)
File "pandas/parser.pyx", line 1033, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10737)
File "pandas/parser.pyx", line 1130, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:12141)
File "pandas/parser.pyx", line 1150, in pandas.parser.TextReader._string_convert (pandas/parser.c:12355)
File "pandas/parser.pyx", line 1382, in pandas.parser._string_box_utf8 (pandas/parser.c:17679)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 4: unexpected end of data
In Python3, when pd.read_csv is passed a file path (as opposed to a file buffer) it decodes the contents with the utf-8 codec by default.1 It appears your CSV file is using a different encoding. Since it was generated by MS Excel, it might be cp-1252:
In [25]: print('\xc9'.decode('cp1252'))
É
In [27]: import unicodedata as UDAT
In [28]: UDAT.name('\xc9'.decode('cp1252'))
Out[28]: 'LATIN CAPITAL LETTER E WITH ACUTE'
The error message
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9
says that '\xc9'.decode('utf-8') raises a UnicodeDecodeError.
The above shows byte 0xc9 can be decoded with cp1252. It remains to be seen if the rest of the file can also be decoded with cp1252, and if it produces the desired result.
Unfortunately, given only a file, there is no surefire way to tell what
encoding (if any) was used. It depends entirely on the program used to generate
the file.
If cp1252 is the right encoding, then to load the file into a DataFrame use
data = pd.read_csv('./2010-11.csv', encoding='cp1252')
1 When pd.read_csv is passed a buffer, the buffer could have been opened with encoding already set:
# Python3
with open('/tmp/test.csv', 'r', encoding='cp1252') as f:
df = pd.read_csv(f)
print(df)
in which case pd.read_csv will not attempt to decode since the buffer f is already supplying decoded strings.

UnicodeDecodeError: in python 2.7

I am working with the VirusTotal api, exactly with this:
https://www.virustotal.com/es/documentation/public-api/#scanning-files
This is the part of my scritp where im having problems:
def scanAFile(fileToScan):
host = "www.virustotal.com"
selector = "https://www.virustotal.com/vtapi/v2/file/scan"
fields = [("apikey", myPublicKey)]
file_to_send = open(fileToScan, "rb").read()
files = [("file", fileToScan, file_to_send)]
json = postfile.post_multipart(host, selector, fields, files)
return simplejson.loads(json)
With some files there is not any error and it runs fine, but the next error is occurring when trying to scan some files, for example this error is for a jpg file:
Traceback (most recent call last):
File "F:/devPy/myProjects/script_vt.py", line 138, in <module>
scanMyFile()
File "F:/devPy/myProjects/script_vt.py", line 75, in scanQueue
jsonScan = scanAFile(fileToScan)
File "F:/devPy/myProjects/script_vt.py", line 37, in scanAFile
json = postfile.post_multipart(host, selector, fields, files)
File "F:\devPy\myProjects\script_vt.py", line 10, in post_multipart
content_type, body = encode_multipart_formdata(fields, files)
File "F:\devPy\myProjects\script_vt.py", line 42, in encode_multipart_formdata
body = CRLF.join(L)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
I should indicate I work with pycharm under windows, could this cause the encoding error?
Any idea how to solve it? I got stack and couldnt find any solution on the net.

Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

I'm stuck here trying to unescape HTML special characters.
The problematic text is
Rudimental & Emeli Sandé
which should be converted to
Rudimental & Emeli Sandé
The text is downloaded via WGET (outside of python)
To test this, save a ANSI file with this line and import it.
import HTMLParser
trackentry = open('import.txt', 'r').readlines()
print(trackentry)
track = trackentry[0]
html_parser = HTMLParser.HTMLParser()
track = html_parser.unescape(track)
print(track)
I get this error when a line has é in it.
*pi#raspberrypi ~/scripting $ python unparse.py
['Rudimental & Emeli Sand\xe9\n']
Traceback (most recent call last):
File "unparse.py", line 9, in <module>
track = html_parser.unescape(track)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)*
The same code works fine under windows - I only have problems on the raspberry pi
running Python 2.7.3.
Python cannot decode 'é' ('\xe9') using the ASCII codec because this character is not 7-bit ASCII.
Your problem (condensed):
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
output = parser.unescape(input)
produces
Traceback (most recent call last):
File "problem.py", line 4, in <module>
output = parser.unescape(input)
File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)
HTMLParser.unescape() returns a unicode object, and therefore has to convert your input str. So it asks for the default encoding (which in your case is ASCII) and fails to interpret '\xe9' as an ASCII character (because it isn't). I guess your file encoding is ISO-8859-1 where '\xe9' is 'é'.
There are two easy solutions. Either you do the conversion manually:
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
input = input.decode('iso-8859-1')
output = parser.unescape(input)
or you use codecs.open() instead of open() whenever you are working with files:
import codecs
import HTMLParser
parser = HTMLParser.HTMLParser()
input = codecs.open("import.txt", encoding="iso-8859-1").readline()
output = parser.unescape(input)