What can be wrong with this import?
I downloaded version 4.4 for Jython 2.7
import ftfy
import sys
print (ftfy.fix_encoding("н368вв777"))
Traceback (most recent call last):
File "D:/rs_al/IdeaProjects/XLStoSQL/src/main/java/BrokenUTF8.py", line 4,
in <module>
import ftfy
File "C:\jython\Lib\site-packages\ftfy\__init__.py", line 12, in <module>
from ftfy import fixes
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8:
illegal Unicode character
File "C:\jython\Lib\site-packages\ftfy\__init__.py", line 12, in <module>
from ftfy import fixes
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8: illegal Unicode character
With Python3 + ftfy 5 everything works, but I thought about using java + jython to convert wrong UTF8 characters with ftfy package and return data back to java.
Also, I set default decoding of source to UTF-8, because when I use jython 2.7 default decoding of sources is ascii.
At full power ftfy works only with Python 3. Moved project to Python. Solved
Related
Help me figure out what's wrong with this. I am running Text summarization using Transformers
~/Bart_T5-summarization$ python app.py
No handlers could be found for logger "transformers.data.metrics"
Traceback (most recent call last):
File "app.py", line 6, in
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/init.py", line 42, in
from .tokenization_auto import AutoTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_auto.py", line 28, in
from .tokenization_xlm import XLMTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_xlm.py", line 27, in
import sacremoses as sm
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/init.py", line 2, in
from sacremoses.tokenize import *
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 16, in
class MosesTokenizer(object):
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 41, in MosesTokenizer
PAD_NOT_ISALNUM = r"([^{}\s.'`\,-])".format(IsAlnum), r" \1 "
enter image description here
UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)
Running the command with python3 instead of python solved this issue for me. I was able to run the code and obtain a summarization.
I have a Python script which read and write a file with german umlauts (äöü) in an input file "myfile.in". I used Python version 2.7. Here a reduced version of my script:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
if __name__=='__main__':
with open("myfile.in", "r") as f:
lines = f.readlines()
txt = ""
for line in lines:
txt = txt + line
with open("myfile.out", "w") as f:
f.write(txt)
This works fine.
Now I got the requirement from my customer to used the Future statement definitions and I added the following line to my Python script:
from __future__ import unicode_literals
Now I get the following error message:
Traceback (most recent call last):
File "myscript.py", line 9, in <module>
txt = txt + line
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 23: ordinal not in range(128)
How can I resolve this problem.
Thanks for your hints Thomas
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-
quotes.html").read()
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
"\xe2\x80\x9c" is the UTF-8 character for curly quotes. When I try to find curly quotes in a website using this code, I get this error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2265:
ordinal not in range(128)
What does this error mean, what am I doing wrong, and how do I fix it?
You have to use decode('utf-8') to decode the string.
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read().decode('utf-8')
web = web.replace(b"\xe2\x80\x9c".decode('utf8'), '"')
print(web)
This is due to the Python 2 interpreter using the "ascii" codec as default for the string literals. In future code (Python 3) the default is utf-8 and you can have unicode literal characters in your code. You can do that now, with your Python 2, using a future import.
from __future__ import unicode_literals
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read()
web = web.decode("utf-8")
web = web.replace('“' , '"')
print(repr(web))
Note that this is a python 2 solution. Python 3 handles strings and bytes differently.
I can reproduce the problem with
>>> web = "0123\xe2\x80\x9c789"
>>> web.replace("\xe2\x80\x9c".decode('utf-8'), '"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
You read an encoded string into web and I just made a simpler one for test. When you decoded the search string, you created a unicode object. For the replacement to work, web needs to be converted to unicode.
>>> "\xe2\x80\x9c".decode('utf-8')
u'\u201c'
>>> unicode(web)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
It was the web conversion that got you. In python 2, str can hold encoded bytes - and that's exactly what you have here. One option is to just replace the encoded byte sequence
>>> web.replace("\xe2\x80\x9c", '"')
'0123"789'
This only works because you knew the page was encoded with utf-8. That is usually the case, but worth the mention.
This is really a simple script written in python, which I can run it normally on Linux. But when I moved it to Windows, there is a strange error. I wish some helps.
Before running the code, I have made some preparation for the environment:
1. Install Microsoft Visual C++ Compiler for python 2.7
2. Install python 2.7.11
3. pip install pyinstaller
4. easy_install pyshark
Below is part of my code.
# -*- coding: utf-8 -*-
from __future__ import print_function
import pyshark
import lxml
import os
def analysis_method(file_name):
cap = pyshark.FileCapture(input_file=file_name)
for packet in cap:
if hasattr(packet, "http"):
http_layer = packet["http"]
Below is the error information:
Traceback (most recent call last):
File "packet_offline_analysis.py", line 36, in analysis_method
for packet in cap:
File "C:\Python27\lib\site-packages\pyshark-0.3.6.1-py2.7.egg\pyshark\capture\capture.py", line 173, in _packets_from_tshark_sync
self._get_packet_from_stream(tshark_process.stdout, data, psml_structure=psml_structure))
File "C:\Python27\lib\site-packages\trollius-1.0.4-py2.7-win32.egg\trollius\base_events.py", line 300, in run_until_complete
return future.result()
File "C:\Python27\lib\site-packages\trollius-1.0.4-py2.7-win32.egg\trollius\futures.py", line 287, in result
raise self._exception
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xD6 0xD0 0xB9 0xFA, line 6, column 58
Uninstall your Chinese version of Wireshark, and install a english version of Wireshark instead.
Then the problem is solved.
Code:
import urllib2
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.prettify())
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 7, in <module>
print(soup.prettify())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 8775: ordinal not in range(128)
[Finished in 2.4s with exit code 1]
I can't seem to get the error. I am using Python 2.7.9.
If you have a console as ASCII then during print, there is a conversion from unicode to ascii, and if there is character outside ASCII scope - exception is thrown.
But if console can accept unicode, then everything is correctly displayed.Try this command and run program again
export LANG=en_US.UTF-8