S3/Athena query result location and “Invalid S3 folder location” - amazon-web-services

Are there particular requirements to the bucket for specifying the query result location? When I try to create a new table, I get a popup:
Before you run your first query, you need to set up a query result location in Amazon S3. Learn more
So I click the link and specify my query result location in the format specified s3://query-results-bucket/folder. But it always says
Invalid S3 folder location
I posted this in Superuser first but it was closed (not sure why...).

The folder name needs to have a trailing slash:
s3://query-results-bucket/folder/

Ran into this earlier in the week.
First, make sure the bucket exists. There doesn't appear to be an option to create the bucket when setting the value in the athena console.
Next, make sure you have the bucket specified properly. In my case, I initially had s3:/// - there is no validation, so an extra character will cause this error. If you go to the athena settings, you can see what the bucket settings look like.
Finally check the workgroup - there is a default workgroup per account, make sure it's not disabled. You can create additional workgroups, each of which will need its own settings.

Related

AWS S3 file with same name does not get overwrite but gets characters added at the end of filename

Below is an example for my scenario,
I have a Django API which allows user to upload images to a certain directory, the images will be stored in an S3 bucket. Let's say the file name is 'example.jpeg'
User again uploads image with the same name 'example.jpeg' to the same directory.
Both of them correctly show up in the same directory but the second one gets additional characters at the end of the filename like this 'example_785PmrM.jpeg'. I suspect the additional characters are added by s3 but my research says s3 will overwrite the file with same name.
How can I enable the overwrite feature, I haven't seen any option for this.
Thanks
S3 itself does not change a key on it's own. The only option I see that can be impacting this is Django's storage backend for S3:
AWS_S3_FILE_OVERWRITE (optional: default is True)
By default files with the same name will overwrite each other. Set this to False to have extra characters appended.
So you should set AWS_S3_FILE_OVERWRITE to True to prevent this behavior.
Depending on your exact needs, consider enabling S3 versioning so you can access previous versions of a objects as they're overwritten in S3 in the future.

Correct naming convention for Cloud Run: gcloud builds submit media bucket names (The specified bucket does not exist)

I am following this tutorial to upload my existing Django project running locally to Google Cloud Run. I believe I have followed all the steps correctly to create the bucket and grant it the necessary permissions. But when I try to run:
gcloud builds submit \
--config cloudmigrate.yaml \
--substitutions=_INSTANCE_NAME=cgps-reg-2-postgre-sql,_REGION=us-central1
I get the error:
Step #3 - "collect static": google.api_core.exceptions.NotFound: 404 POST https://storage.googleapis.com/upload/storage/v1/b/cgps-registration-2_cgps-reg-2-static-files-bucket/o?uploadType=multipart&predefinedAcl=publicRead:
I was a little confused by this line that seams to tell you to put the bucket name in the location field, but I think its perhaps just a typo in the tutorial. I was not sure if I should leave location at the default "Multi-Region" or change it to "us-central1" where everyting else in the project is.
The instructions for telling the project the name of the bucket I interpreted as PROJECT_ID + "_" + BUCKET_NAME:
or in my case
cgps-registration-2_cgps-reg-2-static-files-bucket
But clearly this naming convention is not correct as the error clearly says it can not find a bucket with this name. So what am I missing here?
Credit for this answer really goes to dazwilken. The answer he gave in the comment is the correct one:
Your bucket name is cgps-reg-2-static-files-bucket. This is its
globally unique name. You should not prefix it (again) with the
Project name when referencing it. The error is telling you (correctly)
that the bucket (called
cgps-registration-2_cgps-reg-2-static-files-bucket) does not exist. It
does not. The bucket is called cgps-reg-2-static-files-bucket
Because bucket names must be unique, one way to create them it to
combine another unique name i.e. the Google Cloud Project ID in their
naming. The tutorial likely confused you by using this approach but
without explaining it.

AWS S3 object in folder key mismatch

According to Object Key and Metadata - Amazon Simple Storage Service, Amazon S3 buckets have a flat structure, meaning that an object created in a folder folder1/object1.txt would have the key folder1/object1.txt. However there is a discrepancy between the docs and what the AWS console shows.
When you click on the checkbox next to the object1.txt the properties panel slides in from the right and there is a key property under the overview section that reads key object1.txt. This according to the documentation is incorrect. Additionally if you click on the object link the new overview screen shows a different panel in which they key is folder1/object1.txt.
My Question is: What is the reason for this discrepancy and which panel is displaying the correct information? Is the key represented in the first panel something entirely different than the s3 object key?
The documentation is correct.
However, since humans enjoy the concept of folders and directories, Amazon S3 provides something called a Common Prefix, which is similar to the concept of a path.
When listing the contents of a bucket, paths (effectively keys without the final "object name") are provided are a list of CommonPrefixes. The AWS Management Console uses this to allow users to step through folder hierarchies.
However, the Key of all objects include their full path.
Here's something interesting... if a user clicks "New Folder" in the Amazon S3 management console, then a zero-length file is created with the name of the folder. This causes the folder to appear as a common prefix, even if no files exist "inside" the folder.
Correct object key would obviously be folder1/object1.txt.
As for the reason for the discrepancy I'd sign it off to a poor UI decision.

is there any way to setup s3 bucket to get append to the existing object for each run?

We have a requirement to append to the existing S3 object, when we run the spark application every hour. I have tried this code:
df.coalesce(1).write.partitionBy("name").mode("append").option("compression", "gzip").parquet("s3n://path")
This application is creating new parquet files for every run. Hence, I am looking for a workaround to achieve this requirement.
Question is:
How we can configure the S3 bucket to get append to the existing object?
It is not possible to append to objects in Amazon S3. They can be overwritten, but not appended.
There is apparently a sneaky method where a file can be multi-part copied, with the 'source' set to the file and then set to some additional data. However, that cannot be accomplished in the method you show.
If you wish to add additional data to an External Table (eg used by EMR or Athena), then simply add an additional file in the correct folder for the desired partition.

AWS CloudFront Behavior

I've been setting up aws lambda functions for S3 events. I want to set up a new structure for my bucket, but it's not possible--so I set up a new bucket the way I want and will migrate old things and send new things there. I wanted to have some of the structure the same under a given base folder name old-bucket/images and new-bucket/images. I set up CloudFront to serve from old-bucket/images now, but I wanted to add new-bucket/images as well. I thought the behavior tab would set it such that it would check the new-bucket/images first then old-bucket/images. Alas, that didn't work. If the object wasn't found in the first, that was the end of the line.
Am I misunderstanding how behaviors work? Has anyone attempted anything like this?
That is expected behavior. An origin tells Amazon CloudFront where to obtain the data to serve to users, based upon a prefix, suffix, etc.
For example, you could serve old-bucket/* from one Amazon S3 bucket, while serving new-bucket/* from a different bucket.
However, there is no capability to 'fall-back' to a different origin if a file is not found.
You could check for the existence of files before serving the link, and then provide a different link depending upon where the files are stored. Otherwise, you'll need to put all of your files in the location that matches the link you are serving.