I'm trying to save .txt files with utf-8 text inside. There are sometimes emojis or chars like "ü","ä","ö", etc.
Opening the file like this:
with file.open(mode='rb') as f:
print(f.readlines())
newMessageObject.attachment = DJFile(f, name=file.name)
sha256 = get_checksum(attachment, algorithm="SHA256")
newMessageObject.media_sha256 = sha256
newMessageObject.save()
logger.debug(f"[FILE][{messageId}] Added file to database")
Readlines is binary, but the file that is created with DJFile is not utf-8 encoded. How can I do that?
Related
Encoded text
I want to read list from file but its getting all coded and .encode doesn't really work
import json,sys
with open('your_file.txt') as f:
lines = f.read().splitlines()
self.logger.info(lines)
self.tts.say(lines[1])
If your file is saved with UTF-8 encoding, this should work:
with open('text.txt', encoding = 'utf-8', mode = 'r') as my_file:
If this doesn't work, your text file's encoding is not UTF-8. Write your file's encoding in place of utf-8. How to determine the encoding of text?
Or if you share your input file as is, I can figure that out for you.
When I extract my zip file containing a file with Å, Ä or Ö letters,
I get garbage characters.
Im using python 2.7.
with zipfile.ZipFile(temp_zip_path.decode('utf-8')) as f:
for fn in f.namelist():
extracted_path = f.extract(fn)
Zipfile assumes that the encoding of the filenames is CP437. If your zipfile encoding is not unicode, you need to decode file/directory names if they contain accented letters in order to see the non-garbaged name. But if you try to extract contents based on the decoded string, it won't be found, because zipfile will find stuff by the original (garbage or not) name.
You could rename the files one by one after extracting but that would be painful.
What you could do is something like this: read the contents and write them on the decoded name.
# -*- coding: utf-8 -*-
import zipfile
import os
temp_zip_path = r'd:\Python_projects\sandbox\cp_437.zip'
temp_zip_path2 = r'd:\Python_projects\sandbox\unicode.zip'
target_loc = os.path.dirname(os.path.realpath(__file__))
def unpack_cp437_or_unicode(archive_path):
with zipfile.ZipFile(archive_path) as zz:
for zipped_name in zz.namelist():
try:
real_name = zipped_name.decode('cp437')
except UnicodeEncodeError:
real_name = zipped_name
with zz.open(zipped_name) as archived:
contents = archived.read()
if zipped_name.endswith('/'):
dirname = os.path.join(target_loc, real_name)
if not os.path.isdir(dirname):
os.makedirs(dirname)
else:
with open(os.path.join(target_loc, real_name), 'wb') as target:
target.write(contents)
unpack_cp437_or_unicode(temp_zip_path)
unpack_cp437_or_unicode(temp_zip_path2)
I'm trying to do load a .csv file with utf-8 text format and write it in a cp1252(ansi) format with pipe delimiters. The following code works in Python 3.6 but I need it to work in Python 2.6. However, the 'open' function does not allow an encoding keyword in Python 2.6.
import datetime
import csv
# Define what filenames to read
filenames = ["FILE1","FILE2"]
infilenames = [filename+".csv" for filename in filenames]
outfilenames = [filename+"_out_.csv" for filename in filenames]
# Read filenames in utf-8 and write them in cp1252
for infilename,outfilename in zip(infilenames,outfilenames):
infile = open(infilename, "rt",encoding="utf8")
reader = csv.reader(infile,delimiter=',',quotechar='"',quoting=csv.QUOTE_MINIMAL)
outfile = open(outfilename, "wt",encoding="cp1252")
writer = csv.writer(outfile, delimiter='|', quotechar='"', quoting=csv.QUOTE_NONE,escapechar='\\')
for row in reader:
writer.writerow(row)
infile.close()
outfile.close()
I tried several solutions:
Not defining encoding. Results in error on certain unicode characters
use io library (io.open instead of open). Results in "Type error: cannot write str to text in text stream".
Does anyone know the correct solution for this in Python 2.X?
There may be some redundant code here but I got this to work by doing the following:
First I did the enconding using the .decode and .encode funtion to make it "cp1252".
Then I read the csv from the cp1252 encoded file and wrote it to a new csv
...
import datetime
import csv
# Define what filenames to read
filenames = ["FILE1","FILE2"]
infilenames = [filename+".csv" for filename in filenames]
outfilenames = [filename+"_out_.csv" for filename in filenames]
midfilenames = [filename+"_mid_.csv" for filename in filenames]
# Iterate over each file
for infilename,outfilename,midfilename in zip(infilenames,outfilenames,midfilenames):
# Open file and read utf-8 text, then encode in cp1252
infile = open(infilename, "r")
infilet = infile.read()
infilet = infilet.decode("utf-8")
infilet = infilet.encode("cp1252","ignore")
#write cp1252 encoded file
midfile = open(midfilename,"w")
midfile.write(infilet)
midfile.close()
# read csv with new cp1252 encoding
midfile = open(midfilename,"r")
reader = csv.reader(midfile,delimiter=',', quotechar='"',quoting=csv.QUOTE_MINIMAL)
# define output
outfile = open(outfilename, "w")
writer = csv.writer(outfile, delimiter='|', quotechar='"',quoting=csv.QUOTE_NONE,escapechar='\\')
#write output to new csv file
for row in reader:
writer.writerow(row)
print("written file",outfilename)
infile.close()
midfile.close()
outfile.close()
I use Python 2.7 on Win 7 Pro SP1.
I try code:
import os
path = "E:/data/keyword"
os.chdir(path)
files = os.listdir(path)
query = "{keyword} AND NOT("
result = open("query.txt", "w")
for file in files:
if file.endswith(".txt"):
file_path = file.name
dane = open(file_path, "r")
query.append(dane)
result.append(" OR ")
result.write(query)
result.write(")")
result.close()
I get error:
file_path = file.name AttributeError: 'str' object has no attribute
'name'
I can't figure why.
I have secon error when path is with polish dialectical chars like "ąęłńóżć". I get error for:
path = "E:/Bieżące projekty/keyword"
I try fix it to:
path =u"E:/Bieżące projekty/keyword"
but it not help. I'm starting with Python and I can't find out why this code is not working.
What i want
Find all text file in the directory.
Join all text file in one file text named "query.txt"
fx.
file 1
data1 data2
file 2
data 3 data 4
Output from "query.txt":
data1 data2 data 3 data 4
Above code working fine when path variable is without polish dialectical characters. When I change path I get error:
SyntaXError: Non-ASCII character '\xc5' in file query.py on line 9, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
On python doc PEP263 I find magic quote. Polish lang coding characters like "ąęłńóźżć" standard is ISO-8859-2. So i try add encoding to code. I try use UTF-8 too and I get the same error. My all code is (without 5 first lines with comment what code doing):
import os
#path = r"E:/data"
# -*- coding: iso-8859-2 -*-
path = r"E:/Bieżące przedsięwzięcia"
os.chdir(path)
files = os.listdir(path)
query = "{keyword} AND NOT("
for file in files:
if file.endswith(".txt"):
dane = open(file, "r")
text = dane.read()
query += text
print(query)
dane.close()
query.join(" OR ")
result = open("query.txt", "w")
result.write(query)
result.write(")")
result.close()
On Unicode/UTF-8 character here I found that polish char "ż" is coded in UTF-8 as "\xc5\xbc". Mark # to coding line with path with "ż" as comment make error too. When I remove line with this char code:
path = r"E:/Bieżące przedsięwzięcia"
working fine and I get result which I want.
For editing I use Notepad++ with default setings. I only set in python code tab replace by four space.
*
Second Question
I try find in Python doc in variable path what r does mean. I can't find it in Python 2.7 string documentation. Could someone tell my how this part of Python (like u, r before string value) is named fx.
path = u"somedata"
path = r"somedata"?
I would get doc to read about it.
The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.
This is evident when the filename contains spaces or non-ASCII characters.
For example, I have upload the following filename to S3:
my file řěąλλυ.txt
The notification is received as:
{
"Records": [
"s3": {
"object": {
"key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
}
}
]
}
I've tried to decode using:
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')
but that yields:
my file ÅÄÄλλÏ.txt
Of course, when I then try to get the file from S3 using Boto, I get a 404 error.
tl;dr
You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.
For example, for an S3 object with the filename: my file řěąλλυ.txt:
>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'
>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'
# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.
>>> type(key)
<type 'unicode'>
>>> print(key)
my file řěąλλυ.txt
Background
AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
The error you should've got from your decode() is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.
Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.
Just in case someone else comes here hoping for a JavaScript solution, here's what I ended up with:
function decodeS3EventKey (key = '') {
return decodeURIComponent(key.replace(/\+/g, ' '))
}
With limited testing, it seems to work fine:
test+image+%C3%BCtf+%E3%83%86%E3%82%B9%E3%83%88.jpg decodes to test image ütf テスト.jpg
my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt decodes to my file řěąλλυ.txt
For python 3:
from urllib.parse import unquote_plus
result = unquote_plus('input/%D0%BF%D1%83%D1%81%D1%82%D0%BE%D0%B8%CC%86.pdf')
print(result)
# will prints 'input/пустой.pdf'