Redshift COPY from AWS S3 directory full of CSV files - amazon-web-services

I am trying to perform a COPY query in Redshift in order to load different .csv files stored in a AWS S3 path (let's say s3://bucket/path/csv/). The .csv files in that path contain a date in their filenames (i.e.: s3://bucket/path/csv/file_20200605.csv, s3://bucket/path/csv/file_20200604.csv,...) since they the data inside them corresponds to the data for a specific day. My question here is (since the order of loading the files matter), will Redshift load these files in alphabetical order?

The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket.
so regards to your question, the files will load in parallel.

Related

See all files in S3 bucket using Redshift Spectrum

We have S3 buckets which are nested folder structure like TeamName/Year/Month/Day/<Parquet files 1 - n>.
We are trying to create a Redshift spectrum (using Glue data catalog) on the S3 folder and query data in Redshift. With all the tutorials I have seen so far, it works with the file directly under the root folder. So how do we see multiple files in redshift that are in the bucket with nested folders?
Also, if we add more files or folder e.g. Day2/ParquetFiles, will Spectrum be able to detect this? Is there a way to create spectrum on the root folder? The schema of all files will be same.
It should just read any files in the given path, including subdirectories.
Yes, you can add additional files anywhere in that path and they should be included.
From Creating external tables for Redshift Spectrum - Amazon Redshift:
The external table statement defines the table columns, the format of your data files, and the location of your data in Amazon S3. Redshift Spectrum scans the files in the specified folder and any subfolders.

Partition csv data in s3 bucket for querying using Athena

I have csv log data coming every hour in a single s3 bucket and I want to partition it for improving queries performance as well as converting it to parquet.
Also how can I add partitions automatically for new logs that will be added.
Note :
csv file names follow standard date format
files are written from external source and cannot be edited to be written in folders but only in the main bucket
I wanted to convert csv files to parquet separately
It appears that your situation is:
Objects are being uploaded to an Amazon S3 bucket
You would like those objects to be placed in a path hierarchy to support Amazon Athena partitioning
You could configure an Amazon S3 event to trigger an AWS Lambda function whenever a new object is created.
The Lambda function would:
Read the filename (or the contents of the file) to determine where it should be placed in the hierarchy
Perform a CopyObject() to put the object in the correct location (S3 does not have a 'move' command)
Delete the original object with DeleteObject()
Be careful that the above operation does not result in an event that triggers the Lambda function again (eg do it in a different folder or bucket), otherwise an infinite loop would occur.
When you wish to convert the CSV files to Parquet, see:
Converting to Columnar Formats - Amazon Athena
Using AWS Athena To Convert A CSV File To Parquet | CloudForecast Blog

how to append multiple csv files from different folder in s3

I have many csv files under different sub-directory in S3.
I try to append all data with one csv file.
Is there any way to append all files using s3 or other AWS services?
Thanks
If the result csv is not very large (less than couple of Gb), you can use AWS Lambda to go through all subdirectories (keys) in S3 and write result file [for example into S3 again].
Also you can use AWS Glue for this operation, but I didn't use it.
In both cases you should write some script for joining files.

How can I download s3 bucket data?

I'm trying to find some way to export data from an s3 bucket such as file path, filenames, metadata tags, last modified, and file size to something like a .csv .xml or .json. Is there any way to generate this without having to manually step through and hand generate it?
Please note I'm not trying to download all the files, rather I'm trying to get at a way to export the exposed data about those files presented in the s3 console.
Yes!
From Amazon S3 Inventory - Amazon Simple Storage Service:
Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet (Parquet) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).

Upload CSV data directly to Amazon Redshift with Talend

Is it possible to upload data directly to Amazon Redshift without passing through Amazon S3 (Using Talend)?
It is possible to do this using talend connectors for postgres, but the result would be very slow indeed (could be seconds per row of data).
You really need to
split large csv files up e.g. 10MB each (no set number for this)
gzip each csv file
upload to s3
run a redshift copy command
run some sql on redshift if required to process the new data (upsert
for example)
It is possible using INSERT queries, but is not at all efficient, and very slow, and thus, not recommended. Redshift is built for handling and managing bulk loads.
Using COPY command to load data into Redshift after splitting the large files into smaller parts, using multi-part file upload to S3 and then loading the data from S3 to Redshift using COPY command, in parallel (see this), is the best and fastest approach to load data into Redshift.