Text files uploaded to S3 are encoded strangely? - amazon-web-services

This is the strangest error, and I don't even know where to start understanding what's wrong.
S3 has been working well, up until suddenly one day (yesterday) it strangely encodes any text file uploaded to strange characters. Whenever a text file has Å, Ä, Ö or any other UTF-8 comparable but none English characters, the text file is messed up. I've tried uploading using various clients, as well as the web interface of AWS. The upload goes well, then I download the file and it's messed up. I've tried downloading it to my Mac, I've tried downloading it onto a Raspberry with Linux on it. Same error.
Is there any encoding done by Amazons S3 servers?!

I had the same problem and I solved it by adding charset=utf-8 in properties -> metadata of the file

You can explicitly set the "Content-Type: text/plain; charset=utf-8", on the file in the S3 console.
This will tell S3 to serve as text.

For those who are using boto3 (python 3) to upload and are having strange characters instead of accentuation (such as in portuguese and french languages, for example), Toni Chaz's and Sony Kadavan's answers gave me the hint to fix. Adding ";charset=utf-8" to ContentType argument when calling put_object was enough to the accentuation be shown correctly.
content_type="text/plain;charset=utf-8"
bucket_obj.put_object(Key=key, Body=data, ContentType=content_type)

Adding <meta charset="utf-8" /> in the <head> of the .html files fixed the problem for me.

Not sure why, but the answer from Sony Kadavan didn't work in my case.
Rather than:
Content-Type: text/plain; charset=utf-8
I used:
Content-Type: text/html; charset=utf-8
Which seemed to work.

In my problem I got the problem with reading file form the filesystem as UFT8 too,, so I got the wrong file encoding in s3 until I have added
InputStreamReader isr = new InputStreamReader(fileInputStream, "UTF8");
instead of
InputStreamReader isr = new InputStreamReader(fileInputStream);
please note of this possible problem too

If your data includes non-ASCII multibyte characters (such as Chinese or Cyrillic characters), you must load the data to VARCHAR columns. The VARCHAR data type supports four-byte UTF-8 characters, but the CHAR data type only accepts single-byte ASCII characters.
Source: http://docs.aws.amazon.com/redshift/latest/dg/t_loading_unicode_data.html

Related

CFMail attachment filename broken in ColdFusion 2016 on Windows server

I'm moving a site from CF10 on Linux to CF2016 on Windows, and have run into an issue with file attachments with cfmail.
I'm attaching the file in cfmail with;
<cfmailparam file="#FileName#">
and have also tried variations with and without disposition and type like:
<cfmailparam file="#FileName#" disposition="attachment;
filename=""#FileName#"""
type="#ContentType#/#ContentSubType#">
But no matter what, on CF2016 on Windows, my attachments names in Outlook come through as ATT00160.dat (without type set) or ATT00169.xlsx (with type set)
It seems filenames over a certain length cause the issue. A filename of 64 characters will break it, but a smaller filename, of say 49 characters, won't.
Viewing the message source in Outlook, from the cfmail sent from Windows, I see the value below. Notice under content-type the name has been split?
Content-Type: application/octet-stream;
name*0=BLAH_BLAH1_Ownership_Database_Issue_2018-01_In_Development2.;
name*1=xlsx
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename*0=BLAH_BLAH1_Ownership_Database_Is;
filename*1=sue_2018-01_In_Development2.xlsx
The same attachment sent with cfmail, from Linux, gives me:
Content-Type: application/octet-stream;
name=BLAH_BLAH1_Ownership_Database_Issue_2018-01_In_Development2.xlsx
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename*0=BLAH_BLAH1_Ownership_Database_Is;
filename*1=sue_2018-01_In_Development2.xlsx
Note the content-type name has not been split.
Anyone have any ideas on how to fix this issue?
So, I finally sorted it. You need to manually put the file name in the type:
<cfmailparam file="#FileName#"
type="#ContentType#/#ContentSubType#;name=""#FileName#""">
I've never needed to do that previously. I don't know if this is a Windows server, CF2016 or SmarterMail (our mail server) thing, but if you run into the same issue, the above worked for me.
You might also see my report to Adobe Bug tracker. I have had trouble with long file name attachments ever since Coldfusion switched to the newer Javamail.
https://tracker.adobe.com/#/view/CF-4199784

Requests Gzip HTTP download and write to disk

I'm using the requests library and python 2.7 to download a gzipped text file from a web api. Using the code below, I'm able to successfully send a get request and, judging from the headers, receive a response in the formed of the gzip file.
I know Requests decompresses these files for you automatically if it detects from the header that the response is gzipped. I wanted to take that download in the form of a file stream and write the contents to disk for storage and future analysis.
When I get open the resulting file in my working directory however I get characters like this: —}}¶— Q#Ï 'õ
For reference, some of the response headers include 'Content-Encoding': 'gzip', 'Content-Type': 'application/download', 'Accept-Encoding,User-Agent'
Am I wrong to write in binary? Am I not encoding the text correctly(ie. could it be ASCII vs utf-8)? There is no apparent character encoding noted in the response headers.
try:
response = requests.get(url, paramDict, stream=True)
except Exception as e:
print(e)
with open(outName, 'wb') as out_file:
for chunk in response.iter_content(chunk_size=1024):
out_file.write(chunk)
EDIT 3.30.2016:
Now I've changed my code a little bit to utilize gzipstream library. I tried using the stream to read the entirety of the Gzipped text file that is in my response content:
with open(outName, 'wb') as out_file, GzipStreamFile(response.content) as fileStream:
streamContent = fileStream.read()
out_file.write(streamContent)
I then received this error:
out_file.write(streamContent)
AttributeError: '_GzipStreamFile' object has no attribute 'close'
The output was an empty text file with the file name as anticipated. Do I need to initialize my streamContent variable outside of the with block so that it doesn't automatically try to call a close method at the end of the block?
EDIT 4.1.2016 Just thought I'd clarify that this DOES NOT have to be a stream, that was just one solution I encountered. I just want to make a daily request for this gzipped file and have it saved locally in plaintext
try:
response = requests.get(url, paramDict)
except Exception as e:
print(e)
data = zlib.decompress(response.content, zlib.MAX_WBITS|32)
with open('outFileName.txt','w') as outFile:
outFile.write(data)
Here is the code that I wrote that ended up working. It is as sigmavirus said: the file was gzipped to begin with. I knew this fact, but did not describe it clearly enough apparently as I kept read/writing the gzipped bytes.
Using the zlib module, I was able to decompress the content of the response all at one time into the data variable; I then wrote that variable containing the decompressed data into a file.
I'm not sure if this is the best or most pythonic way to do this, but it worked. If anyone can enlighten me as to why I cannot gzip.open this content (perhaps I needed to use an alternative method, I tried gzipstream library to no avail), I would appreciate any explanations, but I do consider this question answered.
Thanks to everyone who helped me, even if you didn't have the solution, you helped encourage me to persevere!
So the combination here of stream=True and iter_content is what is causing your problems. What you might want to do is something akin to this (to preserve the streaming behaviour):
try:
response = requests.get(url, params=paramDict, stream=True)
except Exception as e:
print(e)
raw = response.raw
with open(outName, 'wb') as out_file
while True:
chunk = raw.read(1024, decode_content=True)
if not chunk:
break
out_file.write(chunk)
Note that you still want to use bytes because you haven't determined the character encoding of the content so you still have bytes but you're no longer dealing with the gzipped bytes.
You are requesting the raw socket stream which is stripping of the chunk transfer encoding but leaving the content coding intact. In other words: What you've got there is pretty certainly the gzipped content. The presence of the Content-Encoding: gzip header is a strong indicator for that, as http clients are required to remove it should they remove the content coding.
One way to eliminate this would be to send an empty Accept-Encoding header among the request to indicate no encoding were acceptable. If the API is RFC compliant, you should receive an uncompressed response. The other way would be to decompress the stream yourself. I believe this cannot be done natively by the gzip and zlib modules. However, the gzipstream lib should give you a start.

prevent absolute file path in Content-Disposition "filename"

I have a simple HTML Form
<form id="uploadForm" method="post" action="/cgi-bin/test.cgi" enctype="multipart/form-data">
<input type="submit" name="add_something" value="add">
<input size="50" type="file" name="myFile" accept="application/zip">
</form>
In addition I do some web page localization on server side by checking user browser locale or searching for a self set language session cookie.
If I upload a file with
Iron 18.0.1050.0
Opera 11.64.1403
Firefox 3.6.27
Firefox 12.0
Google Chrome 19.0.1084.52
SeaMonkey 2.9.1
all works fine. But If I upload a file with
IE 9.0.8112.16421
Maxton 3.3.8.3000
the localization fails. I detected the issue inside the HTTP request:
Opera 11
Content-Disposition: form-data; name="myFile"; filename="ziptest.zip"
Content-Type: application/zip
and IE 9
Content-Disposition: form-data; name="myFile"; filename="C:\Documents and Settings\m1krsch\Documents\Now Some Spaces\ziptest.zip"
Content-Type: application/x-zip-compressed
If I remove the spaces from the path all works fine in IE and Maxton.
Neighter can I exchange the used cgicc library because it is fixed part of the project nor can I force a user to use a path without spaces. How can I circumvent this issue? Is there a way to force IE/Maxton to use the filename instead of the abolute filepath? Or can I set a specific parameter in cgi/env to prevent transmission of abolute filepath?
[EDIT]
I found out that this is a security issue in IE and Maxton. The security zone model of IE allows by default to "Include local directory path when uploading files". I can disallow this behaviour only by changing the client configuration but I am still searching for an application-based solution.
[/EDIT]
Try replacing the spaces with '%20':
"C:\Documents%20and%20Settings\m1krsch\Documents\Now%20Some%20Spaces\ziptest.zip"
I found a stupid error in my localization code. I am using RapidXML for this and encapsulate the whole localization code and RapidXML headers in one class. Unfortunately I did not read the documentation very careful. The data inside vector<char> object - which is holding the XML document data - is not copied into the XML document object xml_document<> by using the parse() method as expected. This looks like procedural C code to me and is in my opinion bad OOD. The documentation says:
3.1 Lifetime Of Source Text
In-situ parsing requires that source text lives at least as long as the document object.
The problem vanished when I corrected my code to get a global vector<char> object inside my localization class.
Nevertheless I am perplexed why mostly all other browsers have no problem with my old code.

How to retrieve codepage from cURL HTTP response?

I'm using lib-cURL as a HTTP client to retrieve various pages (can be any URL for that matter).
Usually the data comes as a UTF-8 string and then I just call "MultiByteToWideChar" and it works well.
However, some web-pages still use code-page encoding and I see gibberish if i try to convert those pages to UTF-8.
Is there an easy way to retrieve the code page from the data? or I'll have to scan it manually (for "encoding=") and then translate it accordingly.
If so, how do i get the code-page id from name (Code Page Identifiers)?
Thanks,
Omer
There are several location where a document can state its encoding:
the Content-Type HTTP header
the (optional) XML declaration
the Content-Type meta tag inside the document header
for HTML5 documents the charset meta tag.
There are probably even more I've forgotten.
In the end, detecting the actual encoding is rather hard. You really shouldn't do this yourself but use high-level libraries for retrieving and parsing HTML content. I'm sure they are available even for C++, even if they have to be thiefed from the a browser environment. :)
I used DetectInputCodepage in IMultiLanguage2 interface and it worked great !

I can not see utf-8 strings on django powered by tornado server

I have a django server running on tornado server.
When I use special caracters like ó or ñ the page certain part of a certain django template is not rendered (character set has been especified to 'utf-8' in settings.py and tornado_script.py # -- coding: utf-8 --).
Considering that just a certain part of the template is not well rendered (a form) and the server works perfectly using the django built-in runserver, I could supposed the problem is comming from tornado server, but I can not debug that configuration.
If some of you know how to debug this to find the missing configuration, please let me know.
I've been searching a lot last 3 hours with no results.
Best Regards
Probably your browser is guessing the character set wrong. Some browsers allow you to set the encoding, I would suggest trying setting it to UTF-8. If this is indeed the problem, you can set the encoding as a meta tag so all browsers will always pick the right encoding. Add this to head:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
You should also make sure that your special characters are really in UTF-8. Most editors allows you to enforce this. You can also set a special encoding for your Python files, which will choke if anything else appears. Add the following to the beginning of your Pytho source with weird characters:
# encoding: utf-8
I've found that the tornado "render" for templates likes to do its own encoding which may be messing things up for you.
you can look at their source code to see exactly what it does...
Try using "write" instead and see whether the characters appear in the output, then you may have a better idea of what's happening.
J