See all files in S3 bucket using Redshift Spectrum - amazon-web-services

We have S3 buckets which are nested folder structure like TeamName/Year/Month/Day/<Parquet files 1 - n>.
We are trying to create a Redshift spectrum (using Glue data catalog) on the S3 folder and query data in Redshift. With all the tutorials I have seen so far, it works with the file directly under the root folder. So how do we see multiple files in redshift that are in the bucket with nested folders?
Also, if we add more files or folder e.g. Day2/ParquetFiles, will Spectrum be able to detect this? Is there a way to create spectrum on the root folder? The schema of all files will be same.

It should just read any files in the given path, including subdirectories.
Yes, you can add additional files anywhere in that path and they should be included.
From Creating external tables for Redshift Spectrum - Amazon Redshift:
The external table statement defines the table columns, the format of your data files, and the location of your data in Amazon S3. Redshift Spectrum scans the files in the specified folder and any subfolders.

Related

Glue crawler creating tables from file insides folders

I'm trying to crawler an S3 bucket with multiple folders, each one containing some csv files extracted by a Glue Job from Amazon RDS.
In the moment, this is basically the schema for S3:
s3://bucket/folder_table_x/files
s3://bucket/folder_table_y/files
The goal is to crawler this buckets and folders to create a new database and then query via Amazon Athena.
But i'm getting the following tables (i mean, with the name of the file, not the folder):
run-unnamed-1-part-r-00000
Most of the tables are being created correctly, but I'm not being able to deal with some.
I've already set the table level as 2 (is that right?) and also set the option that says "Create a single schema for each S3 path"
Theses files that are being created as tables it contains only the header, but none info.
Anyone can help?

Query to multiple csv fles at S3 through Athena

I exported my SQL DB into S3 in csv format. Each table is exported into separate csv files and saved in Amazon S3. Now, can I send any query to that S3 bucket which can join multiple tables (multiple csv files in S3) and get a result-set? How can I do that and save in a separate csv file?
The steps are:
Put all files related to one table into a separate folder (directory path) in the S3 bucket. Do not mix files from multiple tables in the same folder because Amazon Athena will assume they all belong to one table.
Use the CREATE TABLE to define a new table in Amazon Athena, and specify where the files are kept via the LOCATION 's3://bucket_name/[folder]/' parameter. This tells Athena which folder to use when reading the data.
Or, instead of using CREATE TABLE, an easier way is:
Go to the AWS Glue management console
Select Create crawler
Select Add a data source provide the location in S3 where the data is stored
Provide other information as prompted (you'll figure it out)
Then, run the crawler and AWS Glue will look at the data files in the specified folder and will automatically create a table for that data. The table will appear in the Amazon Athena console.
Once you have created the tables, you can use normal SQL to query and join the tables.

Redshift COPY from AWS S3 directory full of CSV files

I am trying to perform a COPY query in Redshift in order to load different .csv files stored in a AWS S3 path (let's say s3://bucket/path/csv/). The .csv files in that path contain a date in their filenames (i.e.: s3://bucket/path/csv/file_20200605.csv, s3://bucket/path/csv/file_20200604.csv,...) since they the data inside them corresponds to the data for a specific day. My question here is (since the order of loading the files matter), will Redshift load these files in alphabetical order?
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket.
so regards to your question, the files will load in parallel.

How can I download s3 bucket data?

I'm trying to find some way to export data from an s3 bucket such as file path, filenames, metadata tags, last modified, and file size to something like a .csv .xml or .json. Is there any way to generate this without having to manually step through and hand generate it?
Please note I'm not trying to download all the files, rather I'm trying to get at a way to export the exposed data about those files presented in the s3 console.
Yes!
From Amazon S3 Inventory - Amazon Simple Storage Service:
Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet (Parquet) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).

AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

I'm trying to create AWS Glue ETL Job that would load data from parquet files stored in S3 in to a Redshift table.
Parquet files were writen using pandas with 'simple' file schema option into multiple folders in an S3 bucked.
The layout looks like this:
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/01/file_2.parquet
s3://bucket/parquet_table/01/file_3.parquet
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/02/file_2.parquet
s3://bucket/parquet_table/02/file_3.parquet
I can use AWS Glue Crawler to create a table in the AWS Glue Catalog and that table can be queried from Athena, but it does not work when i try to create ETL Job that would copy the same table to Redshift.
If I Crawl a single file or if I crawl multiple files in one folder, it works, as soon as there are multiple folders involved, I get the above mentioned error
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
Similar issues appear if instead of 'simple' schema I use 'hive'. Then we have multiple folders and also empty parquet files that throw
java.io.IOException: Could not read footer: java.lang.RuntimeException: xxx is not a Parquet file (too small)
Is there some recommendation on how to read Parquet files and structure them ins S3 when using AWS Glue (ETL and Data Catalog)?
Redshift doesn't support parquet format. Redshift Spectrum does. Athena also supports parquet format.
The error that you're facing is because the when reading the parquet files from s3 from spark/glue it expects the data to be in hive partition i.e the partition names should have key- value pair, You'll to have the s3 hierarchy in hive style partition something like below
s3://your-bucket/parquet_table/id=1/file1.parquet
s3://your-bucket/parquet_table/id=2/file2.parquet
and so on..
then use the below path to read all the files in bucket
location : s3://your-bucket/parquet_table
If the data in s3 partition the above way, you'll not the face any issues.