boto3 s3 download file, key arg seems redundant to filename - amazon-web-services

I'm trying to understand why there needs to be two arguments here:
key which is basically the filename
path argument, the path of the file
Isn't it redundant to still pass the key?
Can't we just pass the bucket and the filename?

key in S3 could be a long string, like /my-prefix/YYYY/MM/DD/UUID.txt. It can, and usually will, contain things like slash characters. So it makes sense to have to specify the local filename argument separately, because you may not want to save the file in the same path that it is saved in S3, and you also may not want to save it using the same name that was used in S3.

You can download the file to the Filename which has different name than the Key from where you are downloading, e.g. Key = 'hello.txt', Filename = '/tmp/MylocalCopyOfTheHello.txt'

Related

Zip file contents in AWS Lambda

We have a function that gets the list of files in a zip file and it works standalone and in Lambda until the fire is larger than 512 meg.
The function needs to get a list of files in the zip file and read the contents of a JSON file that should be in the zip file.
This is part of the function:
try:
s3_object = s3_client.get_object(Bucket=bucketname, Key=filename)
#s3_object = s3_client.head_object(Bucket=bucketname, Key=filename)
#s3_object = s3_resource.Object(bucket_name=bucketname, key=filename)
except:
return ('NotExist')
zip_file = s3_object['Body'].read()
buffer = io.BytesIO(zip_file)
# buffer = io.BytesIO(s3_object.get()['Body'].read())
with zipfile.ZipFile(buffer, mode='r', allowZip64=True) as zip_files:
for content_filename in zip_files.namelist():
zipinfo = zip_files.getinfo(content_filename)
if zipinfo.filename[:2] != '__':
no_files += 1
if zipinfo.filename == json_file:
json_exist = True
with io.TextIOWrapper(zip_files.open(json_file), encoding='utf-8') as jsonfile:
object_json = jsonfile.read()
The get_object is the issue as it loads gets the file into memory hand obviously the large of the file it is going to be more than it's available in Lambda.
I've tried using head_object but that only gives me the meta data for the file and I don't know how to get the list of files in the zip file when using head_object or resource.Object.
I would be grateful for any ideas please.
It would likely be the .read() operation that consumes the memory.
So, one option is to simply increase the memory allocation given to the Lambda function.
Or, you can download_file() the zip file, but Lambda functions are only given 512MB in the /tmp/ directory for storage, so you would likely need to mount an Amazon EFS filesystem for additional storage.
Or you might be able to use smart-open · PyPI to directly read the contents of the Zip file from S3 -- it knows how to use open() with files in S3 and also Zip files.
Not a full answer, but too long for comments.
It is possible to download only part of a file from S3, so you should be able to grab only the list of files and parse that.
The zip file format places the list of archived files at the end of the archive, in a Central directory file header.
You can download part of a file from S3 by specifying a range to the GetObject API call. In Python, with boto3, you would pass the range as a parameter to the get_object() s3 client method, or the get() method of the Object resource.
So, you could read pieces from the end of the file in 1 MiB increments until you find the header signature (0x02014b50), then parse the header and extract the file names. You might even be able to trick Python into thinking it's a proper .zip file and convince it to give you the list while providing only the last piece(s). An elegant solution that doesn't require downloading huge files.
Or, it might be easier to ask the uploader to provide a list of files with the archive. Depending on your situation, not everything has to be solved in code :).

Apache Beam/Dataflow- passing file path to ReadFromText

I have a use case where I want to read the filename from a metadata table, I have written a pipeline function to read the metadata table, but I am not sure how can I pass this information to ReadFromText as it only takes string as input, Is it possible to assign this value to ReadFromText(). Please suggest some workarounds or ideas how to achieve this, Thanks
code: pipeline | 'Read from a File' >> ReadFromText(I want to pass the file path here?,
skip_header_lines=1)
Note: There will be various folders and files in storage, files are in csv format, but in my use case I can't directly pass the storage location or filename to file path in ReadFromText. I want to read it from metadata and pass the value. Hope I am clear, Thanks
I don't understand why you need to read the metadata. If you want to read all the files inside a folder, you can just provide a blob. This solution working in python, not sure about java.
p|readfromtext("./folder/*.csv")
"*" is the blob here, which allows pipeline to read all the patterns matching .csv. You can also add something at the starting.
What you want is textio.ReadAllFromText which reads from a PCollection instead of taking a string directly.

Flask - Do I need to use secure_filename() on uploads to S3/Google Cloud?

In the Flask documentation for file uploads, they recommend use of secure_filename() to sanitize a file's name before storing it.
Here's their example:
uploaded_file = request.files['file']
if uploaded_file:
filename = secure_filename(uploaded_file.filename) # <<<< note the use of secure_filename() here
file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
return redirect(url_for('display_file',
filename=filename))
The documentation says:
Now the problem is that there is that principle called “never trust user
input”. This is also true for the filename of an uploaded file. All
submitted form data can be forged, and filenames can be dangerous. For
the moment just remember: always use that function to secure a
filename before storing it directly on the filesystem.
With offsite storage (S3 or Google Cloud), I will not be using Flask to store the file on the web server. Instead, I'll rename the upload file (with my own UUID), and then upload it elsewhere.
Example:
blob = bucket.blob('prompts/{filename}'.format(filename=uuid.uui4()))
blob.upload_from_string(uploaded_file.read(), content_type=uploaded_file.content_type)
Under this scenario, am I right that you do you not need to invoke secure_filename() first?
It would seem that because I (a) read the contents of the file into a string and then (b) use my own filename, that my use case is not vulnerable to directory traversal or rogue command-type attacks (e.g. "../../../../home/username/.bashrc") but I'm not 100% sure.
You are correct.
You only need to use the secure_filename function if you are using the value of request.files['file'].filename to build a filepath destined for your filesystem - for example as an argument to os.path.join.
As you're using a UUID for the filename, the user input value is disregarded anyway.
Even without S3, it would also be safe NOT to use secure_filename if you used a UUID as the filename segment of the filepath on your local filesystem. For example:
uploaded_file = request.files['file']
if uploaded_file:
file_uuid = uuid.uuid4()
file.save(os.path.join(app.config['UPLOAD_FOLDER'], file_uuid))
# Rest of code
In either scenario you'd then store the UUID somewhere in the database. Whether you store the originally provided request.files['file'].filename value alongside that is your choice.
This might make sense if you want the user to see the original name of the file when they uploaded it. In that case it's definitey wise to run the value through secure_filename anyway, so there's never a situation where the frontend displays a listing to a user which includes a file called ../../../../ohdear.txt
the secure_filename docstring also points out some other functionality:
Pass it a filename and it will return a secure version of it. This
filename can then safely be stored on a regular file system and passed
to :func:os.path.join. The filename returned is an ASCII only string
for maximum portability.
On windows systems the function also makes sure that the file is not
named after one of the special device files.
>>> secure_filename("My cool movie.mov")
'My_cool_movie.mov'
>>> secure_filename("../../../etc/passwd")
'etc_passwd'
>>> secure_filename(u'i contain cool \xfcml\xe4uts.txt')
'i_contain_cool_umlauts.txt'

What is definition of Amazon S3 prefix

What exactly is the definition of S3 prefix.
Lets say I have the following S3 structure:
photos/2006/January/sample.jpg
photos/2006/February/sample2.jpg
photos/2006/February/sample3.jpg
photos/2006/February/sample4.jpg
what will be the prefix for sample.jpg?
Either photos will be the prefix or the whole path till sample.jpg will be the prefix (i.e photos/2006/January/)
Because there is read write limit for each prefix.
S3 is just an object store, mapping a 'key' to an 'object'. In your case, I see four objects (likely images) with their own keys that are trying to imitate a filesystem's folder structure.
Prefix is referring to any string that would be a prefix to an object's key.
photos/2006/January/sample.jpg is just a key, so any of the following (and more) can be a prefix that would match this key:
pho
photos
photos/2
photos/2006/January/sample.jp
photos/2006/January/sample.jpg
Note that the first three prefixes listed above will be a match for the other keys you mention in your question.
You can think of a prefix as a path to a folder. Although they are not really folders, AWS has created prefix's to make it easier for us to visualize our data.
The prefix path is relative to the object. So for sample.jpg, the prefix is: photos/2006/January/ but if i have a sample2.jpg inside photos/2006/ then the prefix is the latter.

How to download file from aws s3 using python without using key

I need to download an xml file from AWS-S3.
I tried using get_contents_to_filename(fname) , it worked.
But i need to download the file without specifying fname, because if i specify the fname my downloaded file gets saved tofname.
I want to save the file as it is, with its name.
this is my current code
k = Key(bucket)
k.set_contents_from_filename(fname)
can someone please help me to download and fetch the file without using key.
Thanks in advance!
I'm not sure which library you're using, but if k is the AWS key you want to download, then k.name is probably the key name, so k.get_contents_to_filename(k.key) would probably do more or less what you want.
The one problem is that the key name might not be a legal file name, or it may have file path separators. So if the key name were something like '../../../../somepath/somename' the file would be saved somewhere you don't expect. So copy k.name to a string and either sanitize it by changing all dangerous characters to safe ones, or just extract the part of the key name you want to use for the file name.