Is there a way to copy all objects inside a S3 bucket to Redshift using a Wildcard? - amazon-web-services

I have an S3 Bucket called Facebook
The structure is like this :
Facebook/AUS/transformedfiles/YYYYMMDDHH/payments.csv
Facebook/IND/transformedfiles/YYYYMMDDHH/payments.csv
Facebook/SEA/transformedfiles/YYYYMMDDHH/payments.csv
Is there a way to copy all payments.csv to AWS Redshift?
something like :
copy payments Facebook/*/transformedfiles/YYYYMMDDHH/payments.csv

No, because the FROM clause accepts an object prefix, and implies a trailing wildcard.
If you want to load specific files, you'll need to use a manifest file. You would build this manifest by calling ListObjects and programmatically selecting the files you want.
A manifest file is also necessary if you're creating the files and immediately uploading them, because S3 is eventually consistent -- if you rely on it selecting files with a prefix, it might miss some.

Related

Copy data from one folder to another inside a AWS bucket automatically

I want to copy files from one folder to another inside the same bucket.
I have two folders Actual and Backup
As soon as new files comes to actual folder i want a way so that it immediately gets copied to Backup folder.
What you need are S3 Event Notifications. With these you can trigger a lambda function when a new item is put, then if it is put with one prefix, write the same object to the other prefix.
It is also worth noting that, though it is functionally as it seems, S3 doesn't really have directories; just objects. So you are just creating the same object as /Actual/some-file with key /Backup/some-file . It just looks like there is a directory because files /Actual/some-file and /Actual/other-file share a prefix /Actual/.

Replace content in all files inside s3 bucket

I have a s3 bucket which is mapped to a domian say xyz.com . When ever a user register on xyz.com a file is created and stored in s3 bucket. Now i have 1000 of files in s3 and I want to replace some text in those files. All files have common name in start ex abc-{rand}.txt
The safest way of doing this would be to regenerate them again through the same process you originally used.
Personally I would try to avoid find and replace as it could lead to modifying parts that you did not intend.
Run multiple generations in parallel and override the existing files. This will ensure the files you generate will match your expectation and will not need to be modified again.
As a suggestion enable versioning before any of these interactions if you want the ability to rollback quickly in a scenario where it needs to be reverted.
Sadly, you can't do this in place in S3. You have to download them, change their content and re-upload.
This is because S3 is an object storage system, not regular file system.
To simply working with S3 files, you can use third part tool s3fs-fuse. The tool will make the S3 appear like a filesystem on your os.

AWS S3 Listing API - How to list everything inside S3 Bucket with specific prefix

I am trying to list all items with specific prefix in S3 bucket. Here is directory structure that I have:
Item1/
Item2/
Item3/
Item4/
image_1.jpg
Item5/
image_1.jpg
image_2.jpg
When I set prefex to be Item1/Item2, I get as a result following keys:
Item1/Item2/
Item1/Item2/Item3/Item4/image_1.jpg
Item1/Item2/Item3/Item5/image_1.jpg
Item1/Item2/Item3/Item5/image_2.jpg
What I would like to get is:
Item1/Item2/
Item1/Item2/Item3
Item1/Item2/Item3/Item4
Item1/Item2/Item3/Item5
Item1/Item2/Item3/Item4/image_1.jpg
Item1/Item2/Item3/Item5/image_1.jpg
Item1/Item2/Item3/Item5/image_2.jpg
Is there anyway to achieve this in golang?
Folders do not actually exist in Amazon S3. It is a flat object storage system.
For example, using the AWS Command-Line Interface (CLI) I could copy a command to an Amazon S3 bucket:
aws s3 cp foo.txt s3://my-bucket/folder1/folder2/foo.txt
This work just fine, even though folder1 and folder2 do not exist. This is because objects are stored with a Key (filename) that includes the full path of the object. So, the above object actually has a Key (filename) of:
folder1/folder2/foo.txt
However, to make things easier for humans, the Amazon S3 management console makes it appear as though there are folders. In S3, these are called Common Prefixes rather than folders.
So, when you make an API call to list the contents of the bucket while specifying a Prefix, it simply says "List all objects whose Key starts with this string".
Your listing doesn't show any folders because they don't actually exist.
Now, just to contradict myself, it actually is possible to create a folder (eg by clicking Create folder in the management console). This actually creates a zero-length object with the same name as the folder. The folder will then appear in listings because it is actually listing the zero-length object rather than the folder.
This is probably why Item1/Item2/ appears in your listing, but Item1/Item2/Item3 does not. Somebody, at some stage, must have "created a folder" called Item1/Item2/, which actually created a zero-length object with that Key.

AWS CLI - S3 how to replace a folder atomically?

So,
Let's say I have a folder called /example in S3. This folder contains a file called a.txt.
using AWS CLI, how do I upload a local folder, also called example, and replace the current S3 /example atomically. The local folder contains a file called b.txt.
So, I want the behaviour to be that the new S3 /example folder only contains b.txt.
Basically, is there a way to atomically replace an entire folder in S3 with a new one via the AWS CLI?
Thank you!
No, you can't do that.
For starters, S3 is an eventual consistent platform. That means that right after you do a write, you can still get old data back from S3. Practically, this converges quickly (seconds), but there is no upper bound. (They do provide consistency guarantees is some sequence of operations, but generally speaking, it's not strongly consistent)
Secondly, S3 does not have a concept of "folder" or "directory". S3 namespace is flat. The only thing that object /example/a.txt and /example/b.txt have in common is that they start with the same string, just like /foobar.txt and /foobaz.txt begin with the same string. (The User Interface does cheat a bit by treating the / character differently, and giving the illusion of directories)

AWS Redshift: Load data from many buckets on S3

I am trying to load data from two different buckets on S3 to Redshift table. In each bucket, there are directories with dates in their names and each of directories contains many files, but there are not manifest.
Example S3 structure:
# Bucket 1
s3://bucket1/20170201/part-01
s3://bucket1/20170201/part-02
s3://bucket1/20170202/part-01
s3://bucket1/20170203/part-00
s3://bucket1/20170203/part-01
# Bucket 2
s3://bucket2/20170201/part-00
s3://bucket2/20170202/part-00
s3://bucket2/20170202/part-01
s3://bucket2/20170203/part-00
Let's say that data from both buckets for 20170201 and 20170202 should be loaded. One of the solutions can be running 4 times COPY command - ones per each bucket-date pair. But I'm curious if it could be done within single COPY call. I've seen that manifest file allows specifying few different files (including from different buckets). However:
is there option to use prefix instead full path in the manifest,
and can I use somehow manifest in SQL passing it as a string instead file location - I want to avoid creating temporary files on S3?
You can use a manifest file to specify different buckets, paths and files.
The Using a Manifest to Specify Data Files documentation shows an example:
{
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
]
}
The documentation also says:
The URL in the manifest must specify the bucket name and full object path for the file, not just a prefix.
The intent of using a manifest file is to know exactly which files have been loaded into Amazon Redshift. This is particularly useful when loading files that become available on a regular basis. For example, if files appear every 5 minutes and a COPY command was run to load the data from a given prefix, then it is unclear which files have already been loaded. This leads to potentially double-loading files.
The remedy is to use a manifest file that clearly specifies exactly which files to load. This obviously needs some code to find the files, create the manifest file and then trigger the COPY command.
It is not possible to load content from different buckets/paths without using a manifest file.