Jython 2.7.1 + ftfy 4.4 - python-2.7

What can be wrong with this import?
I downloaded version 4.4 for Jython 2.7
import ftfy
import sys
print (ftfy.fix_encoding("н368вв777"))
Traceback (most recent call last):
File "D:/rs_al/IdeaProjects/XLStoSQL/src/main/java/BrokenUTF8.py", line 4,
in <module>
import ftfy
File "C:\jython\Lib\site-packages\ftfy\__init__.py", line 12, in <module>
from ftfy import fixes
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8:
illegal Unicode character
File "C:\jython\Lib\site-packages\ftfy\__init__.py", line 12, in <module>
from ftfy import fixes
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8: illegal Unicode character
With Python3 + ftfy 5 everything works, but I thought about using java + jython to convert wrong UTF8 characters with ftfy package and return data back to java.
Also, I set default decoding of source to UTF-8, because when I use jython 2.7 default decoding of sources is ascii.

At full power ftfy works only with Python 3. Moved project to Python. Solved

Related

UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)

Help me figure out what's wrong with this. I am running Text summarization using Transformers
~/Bart_T5-summarization$ python app.py
No handlers could be found for logger "transformers.data.metrics"
Traceback (most recent call last):
File "app.py", line 6, in
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/init.py", line 42, in
from .tokenization_auto import AutoTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_auto.py", line 28, in
from .tokenization_xlm import XLMTokenizer
File "/home/darshan/.local/lib/python2.7/site-packages/transformers/tokenization_xlm.py", line 27, in
import sacremoses as sm
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/init.py", line 2, in
from sacremoses.tokenize import *
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 16, in
class MosesTokenizer(object):
File "/home/darshan/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 41, in MosesTokenizer
PAD_NOT_ISALNUM = r"([^{}\s.'`\,-])".format(IsAlnum), r" \1 "
enter image description here
UnicodeEncodeError: 'ascii' codec can't encode characters in position 62-11168: ordinal not in range(128)
Running the command with python3 instead of python solved this issue for me. I was able to run the code and obtain a summarization.

Python 2.7 import unicode_literals from __future__ gives UnicodeDecodeError while reading the file with umauts

I have a Python script which read and write a file with german umlauts (äöü) in an input file "myfile.in". I used Python version 2.7. Here a reduced version of my script:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
if __name__=='__main__':
with open("myfile.in", "r") as f:
lines = f.readlines()
txt = ""
for line in lines:
txt = txt + line
with open("myfile.out", "w") as f:
f.write(txt)
This works fine.
Now I got the requirement from my customer to used the Future statement definitions and I added the following line to my Python script:
from __future__ import unicode_literals
Now I get the following error message:
Traceback (most recent call last):
File "myscript.py", line 9, in <module>
txt = txt + line
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 23: ordinal not in range(128)
How can I resolve this problem.
Thanks for your hints Thomas

Python 2.7 - finding UTF-8 characters

from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-
quotes.html").read()
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
"\xe2\x80\x9c" is the UTF-8 character for curly quotes. When I try to find curly quotes in a website using this code, I get this error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2265:
ordinal not in range(128)
What does this error mean, what am I doing wrong, and how do I fix it?
You have to use decode('utf-8') to decode the string.
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read().decode('utf-8')
web = web.replace(b"\xe2\x80\x9c".decode('utf8'), '"')
print(web)
This is due to the Python 2 interpreter using the "ascii" codec as default for the string literals. In future code (Python 3) the default is utf-8 and you can have unicode literal characters in your code. You can do that now, with your Python 2, using a future import.
from __future__ import unicode_literals
from urllib import urlopen
web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read()
web = web.decode("utf-8")
web = web.replace('“' , '"')
print(repr(web))
Note that this is a python 2 solution. Python 3 handles strings and bytes differently.
I can reproduce the problem with
>>> web = "0123\xe2\x80\x9c789"
>>> web.replace("\xe2\x80\x9c".decode('utf-8'), '"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
You read an encoded string into web and I just made a simpler one for test. When you decoded the search string, you created a unicode object. For the replacement to work, web needs to be converted to unicode.
>>> "\xe2\x80\x9c".decode('utf-8')
u'\u201c'
>>> unicode(web)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
It was the web conversion that got you. In python 2, str can hold encoded bytes - and that's exactly what you have here. One option is to just replace the encoded byte sequence
>>> web.replace("\xe2\x80\x9c", '"')
'0123"789'
This only works because you knew the page was encoded with utf-8. That is usually the case, but worth the mention.

using pyshark on python 2.7 encounter lxml.etree.XMLSyntaxError

This is really a simple script written in python, which I can run it normally on Linux. But when I moved it to Windows, there is a strange error. I wish some helps.
Before running the code, I have made some preparation for the environment:
1. Install Microsoft Visual C++ Compiler for python 2.7
2. Install python 2.7.11
3. pip install pyinstaller
4. easy_install pyshark
Below is part of my code.
# -*- coding: utf-8 -*-
from __future__ import print_function
import pyshark
import lxml
import os
def analysis_method(file_name):
cap = pyshark.FileCapture(input_file=file_name)
for packet in cap:
if hasattr(packet, "http"):
http_layer = packet["http"]
Below is the error information:
Traceback (most recent call last):
File "packet_offline_analysis.py", line 36, in analysis_method
for packet in cap:
File "C:\Python27\lib\site-packages\pyshark-0.3.6.1-py2.7.egg\pyshark\capture\capture.py", line 173, in _packets_from_tshark_sync
self._get_packet_from_stream(tshark_process.stdout, data, psml_structure=psml_structure))
File "C:\Python27\lib\site-packages\trollius-1.0.4-py2.7-win32.egg\trollius\base_events.py", line 300, in run_until_complete
return future.result()
File "C:\Python27\lib\site-packages\trollius-1.0.4-py2.7-win32.egg\trollius\futures.py", line 287, in result
raise self._exception
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xD6 0xD0 0xB9 0xFA, line 6, column 58
Uninstall your Chinese version of Wireshark, and install a english version of Wireshark instead.
Then the problem is solved.

Prettify() error using python 2.7

Code:
import urllib2
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.prettify())
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 7, in <module>
print(soup.prettify())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 8775: ordinal not in range(128)
[Finished in 2.4s with exit code 1]
I can't seem to get the error. I am using Python 2.7.9.
If you have a console as ASCII then during print, there is a conversion from unicode to ascii, and if there is character outside ASCII scope - exception is thrown.
But if console can accept unicode, then everything is correctly displayed.Try this command and run program again
export LANG=en_US.UTF-8