UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position - python-2.7

When I try to extract some pattern from a tagged text in nltk, I have the error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128). Firstly I had not this error, but I got it only after installing some packages.
this is the code:
# -*- coding: utf-8 -*-
import codecs
import sys
import re
import sys
import nltk
from nltk.corpus import *
k = nltk.corpus.brown.tagged_words('myfile')
for (w1,t1), (w2,t2) in nltk.bigrams(k):
if t1 == 'NN' and t2 == 'AJ':
print w1, w2
this is the entire output of the code.
Traceback (most recent call last):
File "/home/fathi/egfe.py", line 12, in <module>
for (w1,t1), (w2,t2) in nltk.bigrams(k):
File "/usr/local/lib/python2.7/dist-packages/nltk/util.py", line 442, in bigrams
for item in ngrams(sequence, 2, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/nltk/util.py", line 419, in ngrams
history.append(next(sequence))
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
tokens = self.read_block(self._stream)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/tagged.py", line 241, in read_block
for para_str in self._para_block_reader(stream):
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 564, in read_blankline_block
line = stream.readline()
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1095, in readline
new_chars = self._read(readsize)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1322, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1352, in _incr_decode
return self.decode(bytes, 'strict')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128)

The problem is that the ntlk version is not compatabile with the python version, so it requires an older version of the nltk toolkit.

Related

UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)

Help me figure out what's wrong with this. I am running Text summarization using Transformers
~/Bart_T5-summarization$ python app.py
No handlers could be found for logger "transformers.data.metrics"
Traceback (most recent call last):
File "app.py", line 6, in
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/init.py", line 42, in
from .tokenization_auto import AutoTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_auto.py", line 28, in
from .tokenization_xlm import XLMTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_xlm.py", line 27, in
import sacremoses as sm
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/init.py", line 2, in
from sacremoses.tokenize import *
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 16, in
class MosesTokenizer(object):
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 41, in MosesTokenizer
PAD_NOT_ISALNUM = r"([^{}\s.'`\,-])".format(IsAlnum), r" \1 "
enter image description here
UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)
Running the command with python3 instead of python solved this issue for me. I was able to run the code and obtain a summarization.

Python 2.7 import unicode_literals from __future__ gives UnicodeDecodeError while reading the file with umauts

I have a Python script which read and write a file with german umlauts (äöü) in an input file "myfile.in". I used Python version 2.7. Here a reduced version of my script:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
if __name__=='__main__':
with open("myfile.in", "r") as f:
lines = f.readlines()
txt = ""
for line in lines:
txt = txt + line
with open("myfile.out", "w") as f:
f.write(txt)
This works fine.
Now I got the requirement from my customer to used the Future statement definitions and I added the following line to my Python script:
from __future__ import unicode_literals
Now I get the following error message:
Traceback (most recent call last):
File "myscript.py", line 9, in <module>
txt = txt + line
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 23: ordinal not in range(128)
How can I resolve this problem.
Thanks for your hints Thomas

UnicodeDecodeError in Python Classification Arabic Datasets

I have Arabic datasets for classification using Python; two directories (negative and positive) in a Twitter directory.
I want to use Python classes to classify the data. When I run the attached code, this error occurs:
>
File "C:\Users\DEV2016\Anaconda2\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc7 in position 0: invalid continuation byte
import sklearn.datasets
import sklearn.metrics
import sklearn.cross_validation
import sklearn.svm
import sklearn.naive_bayes
import sklearn.neighbors
dir_path = "E:\Twitter\Twitter"
# Loading files into memory
files = sklearn.datasets.load_files(dir_path)
# Calculating BOW
count_vector = sklearn.feature_extraction.text.CountVectorizer()
word_counts=count_vector.fit_transform(files.data)
# Calculating TFIDF
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(word_counts)
X = tf_transformer.transform(word_counts)
# Create classifier
# clf = sklearn.naive_bayes.MultinomialNB()
# clf = sklearn.svm.LinearSVC()
n_neighbors = 11
weights = 'distance'
clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
# Test the classifier
# Train-test split
test_size=0.4
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, files.target, test_size=test_size)
# Test classifier
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print (sklearn.metrics.classification_report(y_test, y_predicted,
target_names=files.target_names))
print ('Confusion Matrix:')
print (sklearn.metrics.confusion_matrix(y_test, y_predicted))
Traceback
File "<ipython-input-19-8ea269fd9c3d>", line 1, in <module>
runfile('C:/Users/DEV2016/.spyder/clf.py', wdir='C:/Users/DEV2016/.spyder')
File "C:\Users\DEV2016\Anaconda2\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\DEV2016\Anaconda2\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/DEV2016/.spyder/clf.py", line 18, in <module>
word_counts=count_vector.fit_transform(files.data)
File "C:\Users\DEV2016\Anaconda2\lib\site-
packages\sklearn\feature_extraction\text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\DEV2016\Anaconda2\lib\site-
packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab
for feature in analyze(doc):
File "C:\Users\DEV2016\Anaconda2\lib\site-
packages\sklearn\feature_extraction\text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\DEV2016\Anaconda2\lib\site-
packages\sklearn\feature_extraction\text.py", line 116, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "C:\Users\DEV2016\Anaconda2\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc7 in position 0:
invalid continuation byte
In the Twitter data you are trying to load, there are characters that are not recognized by utf-8. Try to load it with other encoding formats like
files = sklearn.datasets.load_files(dir_path, encoding="iso-8859-1")

Draw graph using network

Here's my code:
import networkx as nx
import matplotlib.pyplot as plt
fh =open('one.txt', 'r')
G=nx.read_edgelist(fh, nodetype=int)
fh.close()
print nx.info(G)
nx.draw(G)
plt .show()
But using it I get the following errors:
Traceback (most recent call last): File "graphDeBruijn.py", line 5,
in <module> G=nx.read_edgelist(fh, nodetype=int) File "<decorator-gen-286>", line 2,
in read_edgelist File "C:\PYTHON27\lib\site-packages\networkx-2.0.dev20161201181419-py2.7.egg\networkx\utils\decorators.py", line 221,
in _open_file result = func(*new_args, **kwargs) File "C:\PYTHON27\lib\site-packages\networkx-2.0.dev20161201181419-py2.7.egg\networkx\readwrite\edgelist.py", line 374,
in read_edgelist data=data) File "C:\PYTHON27\lib\site-packages\networkx-2.0.dev20161201181419-py2.7.egg\networkx\readwrite\edgelist.py", line 255,
in parse_edgelist for line in lines: File "C:\PYTHON27\lib\site-packages\networkx-2.0.dev20161201181419-py2.7.egg\networkx\readwrite\edgelist.py", line 371,
in <genexpr> lines = (line.decode(encoding) for line in path) File "C:\PYTHON27\lib\encodings\utf_8.py", line 16,
in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Can anyone help me? Thanks!

Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

I'm stuck here trying to unescape HTML special characters.
The problematic text is
Rudimental & Emeli Sandé
which should be converted to
Rudimental & Emeli Sandé
The text is downloaded via WGET (outside of python)
To test this, save a ANSI file with this line and import it.
import HTMLParser
trackentry = open('import.txt', 'r').readlines()
print(trackentry)
track = trackentry[0]
html_parser = HTMLParser.HTMLParser()
track = html_parser.unescape(track)
print(track)
I get this error when a line has é in it.
*pi#raspberrypi ~/scripting $ python unparse.py
['Rudimental & Emeli Sand\xe9\n']
Traceback (most recent call last):
File "unparse.py", line 9, in <module>
track = html_parser.unescape(track)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)*
The same code works fine under windows - I only have problems on the raspberry pi
running Python 2.7.3.
Python cannot decode 'é' ('\xe9') using the ASCII codec because this character is not 7-bit ASCII.
Your problem (condensed):
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
output = parser.unescape(input)
produces
Traceback (most recent call last):
File "problem.py", line 4, in <module>
output = parser.unescape(input)
File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)
HTMLParser.unescape() returns a unicode object, and therefore has to convert your input str. So it asks for the default encoding (which in your case is ASCII) and fails to interpret '\xe9' as an ASCII character (because it isn't). I guess your file encoding is ISO-8859-1 where '\xe9' is 'é'.
There are two easy solutions. Either you do the conversion manually:
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sand\xe9'
input = input.decode('iso-8859-1')
output = parser.unescape(input)
or you use codecs.open() instead of open() whenever you are working with files:
import codecs
import HTMLParser
parser = HTMLParser.HTMLParser()
input = codecs.open("import.txt", encoding="iso-8859-1").readline()
output = parser.unescape(input)