AWS Redshift: Load data from many buckets on S3 - amazon-web-services

I am trying to load data from two different buckets on S3 to Redshift table. In each bucket, there are directories with dates in their names and each of directories contains many files, but there are not manifest.
Example S3 structure:
# Bucket 1
s3://bucket1/20170201/part-01
s3://bucket1/20170201/part-02
s3://bucket1/20170202/part-01
s3://bucket1/20170203/part-00
s3://bucket1/20170203/part-01
# Bucket 2
s3://bucket2/20170201/part-00
s3://bucket2/20170202/part-00
s3://bucket2/20170202/part-01
s3://bucket2/20170203/part-00
Let's say that data from both buckets for 20170201 and 20170202 should be loaded. One of the solutions can be running 4 times COPY command - ones per each bucket-date pair. But I'm curious if it could be done within single COPY call. I've seen that manifest file allows specifying few different files (including from different buckets). However:
is there option to use prefix instead full path in the manifest,
and can I use somehow manifest in SQL passing it as a string instead file location - I want to avoid creating temporary files on S3?

You can use a manifest file to specify different buckets, paths and files.
The Using a Manifest to Specify Data Files documentation shows an example:
{
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
]
}
The documentation also says:
The URL in the manifest must specify the bucket name and full object path for the file, not just a prefix.
The intent of using a manifest file is to know exactly which files have been loaded into Amazon Redshift. This is particularly useful when loading files that become available on a regular basis. For example, if files appear every 5 minutes and a COPY command was run to load the data from a given prefix, then it is unclear which files have already been loaded. This leads to potentially double-loading files.
The remedy is to use a manifest file that clearly specifies exactly which files to load. This obviously needs some code to find the files, create the manifest file and then trigger the COPY command.
It is not possible to load content from different buckets/paths without using a manifest file.

Related

Is there a way to copy all objects inside a S3 bucket to Redshift using a Wildcard?

I have an S3 Bucket called Facebook
The structure is like this :
Facebook/AUS/transformedfiles/YYYYMMDDHH/payments.csv
Facebook/IND/transformedfiles/YYYYMMDDHH/payments.csv
Facebook/SEA/transformedfiles/YYYYMMDDHH/payments.csv
Is there a way to copy all payments.csv to AWS Redshift?
something like :
copy payments Facebook/*/transformedfiles/YYYYMMDDHH/payments.csv
No, because the FROM clause accepts an object prefix, and implies a trailing wildcard.
If you want to load specific files, you'll need to use a manifest file. You would build this manifest by calling ListObjects and programmatically selecting the files you want.
A manifest file is also necessary if you're creating the files and immediately uploading them, because S3 is eventually consistent -- if you rely on it selecting files with a prefix, it might miss some.

Can AWS Glue Crawler handle different file types in same folder?

I have reports delivered to S3 in the following structure:
s3://chum-bucket/YYYY/MM/DD/UsageReportYYYYMMDD.zip
s3://chum-bucket/YYYY/MM/DD/SearchReportYYYYMMDD.zip
s3://chum-bucket/YYYY/MM/DD/TimingReportYYYYMMDD.zip
The YYYY MM DD vary per day. The YYYMMDD in the filename is there because the files all go into one directory on a server before they are moved to S3.
I want to have 1 or 3 crawlers that deliver 3 tables to the catalog, one for each type of report. Is this possible? I can't seem to specify
s3://chum-bucket/**/UsageReport*.zip
s3://chum-bucket/**/SearchReport*.zip
s3://chum-bucket/**/TimingReport*.zip
I can write one crawler that excludes SearchReport and TimingReport, and therefore crawls the UsageReport only. Is that the best way?
Or do I have to completely re-do the bucket / folder / file name design?
Amazon Redshift loads all files in a given path, regardless of filename.
Redshift will not take advantage of partitions (Redshift Spectrum will, but not a normal Redshift COPY statement), but it will read files from any subdirectories within the given path.
Therefore, if you want to load the data into separate tables (UsageReport, SearchReport, TimingReport), the they need to be in separate paths (directories). All files within the designated directory hierarchy must be in the same format and will be loaded into the same table via the COPY command.
An alternative is that you could point to a specific file using manifest files, but this can get messy.
Bottom line: Move the files to separate directories.

Selecting specific files for athena

While creating a table in Athena, I am not able to create tables using specific files. Is there any way to select all the files starting with "year_2019" from a given bucket? For e.g.
s3://bucketname/prefix/year_2019*.csv
The documentation is very clear about it and it is not allowed.
From:
https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html
Athena reads all files in an Amazon S3 location you specify in the
CREATE TABLE statement, and cannot ignore any files included in the
prefix. When you create tables, include in the Amazon S3 path only the
files you want Athena to read. Use AWS Lambda functions to scan files
in the source location, remove any empty files, and move unneeded
files to another location.
I will like to know if the community has found some work-around :)
Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.
There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.
What you do is you create a table with
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.
The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.
You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109

s3 - Comparing files between two buckets

I would like to compare the file contents of two S3-compatible buckets and identify files that are missing or that differ.
Should I use checksum to do it instead?
It appears that your requirement is to compare the contents of two Amazon S3 buckets and identify files that are missing or differ between the buckets.
To do this, you could use:
Object name: This, of course, will help find missing files
Object size: A different size indicates different contents and the size is given with each bucket listing.
eTag: An eTag is an MD5 checksum on the contents of an object. If the same file has a different eTag, then the contents is different.
Creation date: This is not actually a reliable way to identify differences, but it can be used with other metadata to determine whether you want to update a file. For example, if two files differ the object in the destination bucket has a newer date than the object in the source bucket, you probably don't need to copy the file across. But if the source file was modified after the destination file, it's likely to be a candidate for re-copying.
Instead of doing all the above logic yourself, you can also use the AWS Command-Line Interface (CLI). It has a aws s3 sync command that will compare files from the source and destination, and will then copy files that are modified or missing.

Why AWS S3 uses objects and not file & directories

Why AWS S3 uses objects and not file & directories is there any specific reason to not have directories/folders in s3
You are welcome to use directories/folders in Amazon S3. However, please realise that they do not actually exist.
Amazon S3 is not a filesystem. It is an object storage service that is highly scalable, stores trillions of objects and serves millions of objects per second. To meet the demands of such scale, it has been designed as a Key-Value store. The name of the file is the Key and the contents of the file is the Object.
When a file is uploaded to a directory (eg cat.jpg is stored in the images directory), it is actually stored with a filename of images/cat.jpg. This makes is appear to be in the images directory, but the reality is that the directory does not exist -- rather, the name of the object includes the full path.
This will not impact your normal usage of Amazon S3. However, it is not possible to rename a directory because the directory does not exist. Instead, rename the file to rename the directory. For example:
aws s3 mv s3://my-bucket/images/cat.jpg s3://my-bucket/pictures/cat.jpg
This will cause the pictures directory to magically appear, with cat.jpg inside it. There is not need to create the directory first, because it doesn't actually exist. This is because the user interface is making it appear as though there are directories.
Bottom line: Feel free to use directories, but be aware that they do not actually exist and can't be renamed.