Changing parameters in scikit-learn classifier results in UnicodeDecodeError - python-2.7

I have a csv-file with 100k's of lines with a column of free text containing scandinavian characters among others and I'm fitting scikit-learn classifiers to predict True/False (given in another column) for a given piece of text.
I am using this example as a starting point: http://scikit-learn.org/0.15/auto_examples/grid_search_text_feature_extraction.html
The only thing I changed at first is the data, and the training + testing goes fine with useful results.
However, I want to test the liblinear-like LinearSVC classifier, as that might produce better results in some cases. Changing nothing but the classifier to "LinearSVC" or alternatively sticking to SGDClassifier as in the example but changing the loss function to squared_hinge from the default hinge results in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 312: ordinal not in range(128)
I am assuming this error must rise from the input csv but cannot understand why the initial example from the url runs smoothly and changing the classifier properties with the exact same input data results in this error. Any ideas why this may be?
Secondly, I am not familiar with Python stack tracing and would appreciate any help on how to debug the error / trace it down to the problematic byte. The stack trace is the following:
Traceback (most recent call last):
File "<ipython-input-33-2a0e420237f0>", line 48, in <module>
grid_search.fit(data_train.kuvaus, data_train.loukkaantuneita)
File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\model_selection\_search.py", line 945, in fit
return self._fit(X, y, groups, ParameterGrid(self.param_grid))
File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\model_selection\_search.py", line 564, in _fit
for parameters in parameter_iterable
File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 768, in __call__
self.retrieve()
File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 696, in retrieve
stack_start=1)
File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\externals\joblib\format_stack.py", line 417, in format_outer_frames
return '\n'.join(format_records(output[stack_end:stack_start:-1]))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 312: ordinal not in range(128)
The data_train is Pandas Dataframe with a True/False data_train.loukkaantuneita and a free text (supposed to be utf-8) data_train.kuvaus column.

Put this statements # the begining of the code and try running:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Related

Trying to run Word Embeddings Benchmarks and getting UnicodeDecodeError

I am trying to run the Word Embeddings Benchmarks from this Github: Word Embeddings Benchmarks Github on word2vec embeddings I've created. I've included a picture of what my embedding file looks like.
I keep getting this error:
Traceback (most recent call last):
File "./evaluate_on_all.py", line 75, in <module>
load_kwargs=load_kwargs)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embeddings.py", line 39, in load_embedding
w = Embedding.from_word2vec(fname, binary=False)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embedding.py", line 482, in from_word2vec
words, vectors = Embedding._from_word2vec_text(fname)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embedding.py", line 340, in _from_word2vec_text
header = fin.readline()
File "/share/software/user/open/python/3.6.1/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 16: invalid start byte
I just want to be able to get the benchmarks to work properly with my embeddings.
Results of hexdump of header:
It looks like you're getting the error reading the very-first header line of the file (which suggests it's not something like a challenging word later on):
https://github.com/kudkudak/word-embeddings-benchmarks/blob/2b56c401ea4bba335ebfc0c8c5c4f8ba6394f2cd/web/embedding.py#L340
Are you sure that you're specifying the right plain-text file?
Might the file have extra hidden characters at the beginning, like the 'Byte Order Mark'? (Looking at hexdump -C YOUR_FILE_NAME | head could give a clue.)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xbf' in position 3: ordinal not in range(128)

I encounter an error running a script I wrote on an Ubuntu box with python 2.7.
It seems there is some issue with unicode yet I cannot figure it out.
I tried to encode the variables with UTF-8 yet im not even sure which one causes the error "str(count)" or "tag[u'Value']..
Traceback (most recent call last):
File "./AWS_collection_unix.py", line 105, in <module>
main()
File "./AWS_collection_unix.py", line 91, in main
ec2_tags_per_region(region, text_file)
File "./AWS_collection_unix.py", line 65, in ec2_tags_per_region
print_ids_and_tags(instance, text_file)
File "./AWS_collection_unix.py", line 16, in print_ids_and_tags
text_file.write('%s. %s' % (str(count), tag[u'Value']))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbf' in position 3: ordinal not in range(128)
The error doesnt specify in which parameter the error is..
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbf' in position 3: ordinal not in range(128)
How can I make this work appropriately?
Thanks
The shorties solution is to warp count with -> unicode(count).encode('utf-8'),
What will convert count to be utf-8 encoded str.
But the best way is to understand what is the encoding of the count variable.

apache beam 2.7.0 craches in utf-8 decoding french characters

I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding.
After trying encoding and decoding from "latin-1" to "utf-8" without success (using unicode, unicodedata and codecs) I tried to change things manually...
The Os I am using, has the "ascii" encoding by default and I manually changed in "Anaconda3/envs/py27/lib/site.py" into utf-8.
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init()
sys.setdefaultencoding("utf-8")
I've tried locally with a test file, by printing and then writing a string with accents into a file, and it worked!
string='naïve café'
test_decode=codecs.utf_8_decode(string, "strict", True)[0]
print(test_decode)
with open('./test.txt', 'w') as outfile:
outfile.write(test_decode)
But no luck with apache_beam...
Then I've tried to manually change "/usr/lib/python2.7/encodings/utf_8.py" and put "ignore" instead of "strict" into codecs.utf_8_decode
def decode(input, errors='ignore'):
return codecs.utf_8_decode(input, errors, True)
but I've realized that apache_beam do not use this file or at least does not take it into account any changes
Any ideas how to deal with it?
Please find below the error message
Traceback (most recent call last):
File "etablissementsFiness.py", line 146, in <module>
dataflow(run_locally)
File "etablissementsFiness.py", line 140, in dataflow
| 'Write entities into Datastore' >> WriteToDatastore(PROJECT)
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\pipel
ine.py", line 414, in __exit__
self.run().wait_until_finish()
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\runne
rs\dataflow\dataflow_runner.py", line 1148, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow
pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 642, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", lin
e 156, in execute
op.start()
File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
def start(self):
File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.scoped_start_state:
File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.spec.source.reader() as reader:
File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
for value in reader:
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 2
01, in read_records
yield self._coder.decode(record)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/coders/coders.py", li
ne 307, in decode
return value.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 190: invalid continuation byte
Try to write a CustomCoder class and "ignore" any errors while decoding:
from apache_beam.coders.coders import Coder
class CustomCoder(Coder):
"""A custom coder used for reading and writing strings as UTF-8."""
def encode(self, value):
return value.encode("utf-8", "replace")
def decode(self, value):
return value.decode("utf-8", "ignore")
def is_deterministic(self):
return True
Then, read and write the files using the coder=CustomCoder():
lines = p | "Read" >> ReadFromText("files/path/*.txt", coder=CustomCoder())
# More processing code here...
output | WriteToText("output/file/path", file_name_suffix=".txt", coder=CustomCoder())
This error: "UnicodeDecodeError: 'utf8' codec can't decode byte"
means, that you CSV file still contains some wrong bytes not recognized by the decoder as UTF characters.
The easiest solution for this, is to convert and validate csv input file to not contain UTF8 errors before submitting for Datastore. Simple online UTF8 validation can check it.
If you need to convert latin-1 to UTF8 in python, you can do it like that:
string.decode('iso-8859-1').encode('utf8')

UnicodeDecodeError: 'utf8' in Python 2.7

I have a large file that has many lines, most of the lines are utf8, but looks like a few of lines are not utf8. When I try to read lines with a code like this:
in_file = codecs.open(source, "r", "utf-8")
for line in in_file:
SOME OPERATIONS
I get the following error:
for line in in_file:
File "C:\Python27\lib\codecs.py", line 681, in next
return self.reader.next()
File "C:\Python27\lib\codecs.py", line 612, in next
line = self.readline()
File "C:\Python27\lib\codecs.py", line 527, in readline
data = self.read(readsize, firstline=True)
File "C:\Python27\lib\codecs.py", line 474, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 0: invalid continuation byte
What I would like to do is that for lines that are not utf8 do nothing without breaking the code, and then go to next line in the file and do my operations. How can I do it with try and except?
Open the file without any codec. Then, read the file line-by-line and try to decode each line from UTF-8. If that raises an exception, skip the line.
A completely different approach would be to tell the codec to replace or ignore faulty characters. This doesn't skip the lines but you don't seem to care too much about the contained data anyway, so it might be an alternative.

encoding errors when replacing unwanted characters in utf-8 encoded file

Code can be downloaded here:
https://github.com/kelrien/pyretrieval/
whenever I execute my example.py, the following error pops up:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "example.py", line 21, in <module>
docs.append(proc.process(line.decode("utf-8")))
File "pyretrieval\processor.py", line 61, in process
tokens = self.tokenize(string)
File "pyretrieval\processor.py", line 47, in tokenize
temp = temp.replace(char, self.replace_characters[char])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128)
As you can see - the error happens when trying to replace german umlauts I specified. If I don't use the replace_characters dict and just ignore those umlauts, I'm not getting the error.
I already tried a lot of stuff:
Using the codecs module
Using encode("utf-8") and decode("utf-8") at different
I found a solution. I had to encode the characters I wanted to replace in unicode too(in processor.py).
I already pushed the necessary changes to github. https://github.com/kelrien/pyretrieval