Encoded text
I want to read list from file but its getting all coded and .encode doesn't really work
import json,sys
with open('your_file.txt') as f:
lines = f.read().splitlines()
self.logger.info(lines)
self.tts.say(lines[1])
If your file is saved with UTF-8 encoding, this should work:
with open('text.txt', encoding = 'utf-8', mode = 'r') as my_file:
If this doesn't work, your text file's encoding is not UTF-8. Write your file's encoding in place of utf-8. How to determine the encoding of text?
Or if you share your input file as is, I can figure that out for you.
Related
I'm trying to save .txt files with utf-8 text inside. There are sometimes emojis or chars like "ü","ä","ö", etc.
Opening the file like this:
with file.open(mode='rb') as f:
print(f.readlines())
newMessageObject.attachment = DJFile(f, name=file.name)
sha256 = get_checksum(attachment, algorithm="SHA256")
newMessageObject.media_sha256 = sha256
newMessageObject.save()
logger.debug(f"[FILE][{messageId}] Added file to database")
Readlines is binary, but the file that is created with DJFile is not utf-8 encoded. How can I do that?
When I extract my zip file containing a file with Å, Ä or Ö letters,
I get garbage characters.
Im using python 2.7.
with zipfile.ZipFile(temp_zip_path.decode('utf-8')) as f:
for fn in f.namelist():
extracted_path = f.extract(fn)
Zipfile assumes that the encoding of the filenames is CP437. If your zipfile encoding is not unicode, you need to decode file/directory names if they contain accented letters in order to see the non-garbaged name. But if you try to extract contents based on the decoded string, it won't be found, because zipfile will find stuff by the original (garbage or not) name.
You could rename the files one by one after extracting but that would be painful.
What you could do is something like this: read the contents and write them on the decoded name.
# -*- coding: utf-8 -*-
import zipfile
import os
temp_zip_path = r'd:\Python_projects\sandbox\cp_437.zip'
temp_zip_path2 = r'd:\Python_projects\sandbox\unicode.zip'
target_loc = os.path.dirname(os.path.realpath(__file__))
def unpack_cp437_or_unicode(archive_path):
with zipfile.ZipFile(archive_path) as zz:
for zipped_name in zz.namelist():
try:
real_name = zipped_name.decode('cp437')
except UnicodeEncodeError:
real_name = zipped_name
with zz.open(zipped_name) as archived:
contents = archived.read()
if zipped_name.endswith('/'):
dirname = os.path.join(target_loc, real_name)
if not os.path.isdir(dirname):
os.makedirs(dirname)
else:
with open(os.path.join(target_loc, real_name), 'wb') as target:
target.write(contents)
unpack_cp437_or_unicode(temp_zip_path)
unpack_cp437_or_unicode(temp_zip_path2)
I'm trying to do load a .csv file with utf-8 text format and write it in a cp1252(ansi) format with pipe delimiters. The following code works in Python 3.6 but I need it to work in Python 2.6. However, the 'open' function does not allow an encoding keyword in Python 2.6.
import datetime
import csv
# Define what filenames to read
filenames = ["FILE1","FILE2"]
infilenames = [filename+".csv" for filename in filenames]
outfilenames = [filename+"_out_.csv" for filename in filenames]
# Read filenames in utf-8 and write them in cp1252
for infilename,outfilename in zip(infilenames,outfilenames):
infile = open(infilename, "rt",encoding="utf8")
reader = csv.reader(infile,delimiter=',',quotechar='"',quoting=csv.QUOTE_MINIMAL)
outfile = open(outfilename, "wt",encoding="cp1252")
writer = csv.writer(outfile, delimiter='|', quotechar='"', quoting=csv.QUOTE_NONE,escapechar='\\')
for row in reader:
writer.writerow(row)
infile.close()
outfile.close()
I tried several solutions:
Not defining encoding. Results in error on certain unicode characters
use io library (io.open instead of open). Results in "Type error: cannot write str to text in text stream".
Does anyone know the correct solution for this in Python 2.X?
There may be some redundant code here but I got this to work by doing the following:
First I did the enconding using the .decode and .encode funtion to make it "cp1252".
Then I read the csv from the cp1252 encoded file and wrote it to a new csv
...
import datetime
import csv
# Define what filenames to read
filenames = ["FILE1","FILE2"]
infilenames = [filename+".csv" for filename in filenames]
outfilenames = [filename+"_out_.csv" for filename in filenames]
midfilenames = [filename+"_mid_.csv" for filename in filenames]
# Iterate over each file
for infilename,outfilename,midfilename in zip(infilenames,outfilenames,midfilenames):
# Open file and read utf-8 text, then encode in cp1252
infile = open(infilename, "r")
infilet = infile.read()
infilet = infilet.decode("utf-8")
infilet = infilet.encode("cp1252","ignore")
#write cp1252 encoded file
midfile = open(midfilename,"w")
midfile.write(infilet)
midfile.close()
# read csv with new cp1252 encoding
midfile = open(midfilename,"r")
reader = csv.reader(midfile,delimiter=',', quotechar='"',quoting=csv.QUOTE_MINIMAL)
# define output
outfile = open(outfilename, "w")
writer = csv.writer(outfile, delimiter='|', quotechar='"',quoting=csv.QUOTE_NONE,escapechar='\\')
#write output to new csv file
for row in reader:
writer.writerow(row)
print("written file",outfilename)
infile.close()
midfile.close()
outfile.close()
The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.
This is evident when the filename contains spaces or non-ASCII characters.
For example, I have upload the following filename to S3:
my file řěąλλυ.txt
The notification is received as:
{
"Records": [
"s3": {
"object": {
"key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
}
}
]
}
I've tried to decode using:
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')
but that yields:
my file ÅÄÄλλÏ.txt
Of course, when I then try to get the file from S3 using Boto, I get a 404 error.
tl;dr
You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.
For example, for an S3 object with the filename: my file řěąλλυ.txt:
>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'
>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'
# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.
>>> type(key)
<type 'unicode'>
>>> print(key)
my file řěąλλυ.txt
Background
AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
The error you should've got from your decode() is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.
Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.
Just in case someone else comes here hoping for a JavaScript solution, here's what I ended up with:
function decodeS3EventKey (key = '') {
return decodeURIComponent(key.replace(/\+/g, ' '))
}
With limited testing, it seems to work fine:
test+image+%C3%BCtf+%E3%83%86%E3%82%B9%E3%83%88.jpg decodes to test image ütf テスト.jpg
my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt decodes to my file řěąλλυ.txt
For python 3:
from urllib.parse import unquote_plus
result = unquote_plus('input/%D0%BF%D1%83%D1%81%D1%82%D0%BE%D0%B8%CC%86.pdf')
print(result)
# will prints 'input/пустой.pdf'
So I have this simple python function:
def ReadFile(FilePath):
with open(FilePath, 'r') as f:
FileContent = f.readlines()
return FileContent
This function is generic and used to open all sort of files. However when the file opened is a binary file, this function does not perform as expected. Changing the open() call to:
with open(FilePath, 'rb') as f:
solve the issue for binary files (and seems to keep valid in text files as well)
Question:
Is it safe and recommended to always use rb mode for reading a file?
If not, what are the cases where it is harmful?
If not, How do you know which mode to use if you don't know what type of file you're working with?
Update
FilePath = r'f1.txt'
def ReadFileT(FilePath):
with open(FilePath, 'r') as f:
FileContent = f.readlines()
return FileContent
def ReadFileB(FilePath):
with open(FilePath, 'rb') as f:
FileContent = f.readlines()
return FileContent
with open("Read_r_Write_w", 'w') as f:
f.writelines(ReadFileT(FilePath))
with open("Read_r_Write_wb", 'wb') as f:
f.writelines(ReadFileT(FilePath))
with open("Read_b_Write_w", 'w') as f:
f.writelines(ReadFileB(FilePath))
with open("Read_b_Write_wb", 'wb') as f:
f.writelines(ReadFileB(FilePath))
where f1.txt is:
line1
line3
Files Read_b_Write_wb, Read_r_Write_wb & Read_r_Write_w eqauls to the source f1.txt.
File Read_b_Write_w is:
line1
line3
In the Python 2.7 Tutorial:
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
My takeaway from that is using 'rb' seems to the best practice, and it looks like you ran into the problem they warn about - opening a binary file with 'r' on Windows.