How to skip files with specific extension on Redshift external tables? - amazon-web-services

I have a partitioned location on S3 with data I want to read via Redshift External Table, which I create with the SQL statement CREATE EXTERNAL TABLE....
Only thing is that I have some metadata files within these partitions with, for example, extension .txt while the data I'm reading is .json.
Is it possible to inform Redshift to skip those files, in a similar manner to Glue Crawler exclude patterns?
e.g. Glue crawler exclude patterns

Can you try using the pseudocolumns in the SQL and excluding based on the path name?
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE_usage.html
create external table as ....
select .....
where "$path" like '%.json'

Related

How to create External Table without specifying columns in Redshift?

I have a folder containing files in parquet format. I used crawler to create table defined in Glue Data Catalog which counted to 2500+ columns. I want to create External Table on top of it in redshift.
But all the articles that I read have mentioned the columns explicitly.
Is there any way so that the Table reads schema directly from the table in data catalog and I don't have to feed it separately?
You can create an external schema in Redshift which is based on a data catalog. This way, you will see all tables in the data catalog without creating them in Redshift.
create external schema spectrum_schema
from data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
create external database if not exists;
In the above example from the documentation, spectrum_db is the name of your data catalog.

How to create external table using s3 location from Glue catalog?

I have a Glue job that writes csv data to an s3 bucket. I would like to create an external table or view based on this data. But I do not want to hardcode the s3 location in the table definition, I want to use the s3 location stored in the Glue catalog. Something like this:
CREATE VIEW t1 as (
SELECT * FROM some_glue_catalog_table
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
X-Y-explanation:
What I actually need is to split my csv data into columns. My csv data is properly catalogued, but when I use a select statement on it with Athena, all my data is shoved into the first column, the rest of the columns are empty. I've looked for an option that allows me to specify things like the delimiter and quotation character, but those options only seem available for CREATE EXTERNAL TABLE statements. Which is why I'm now asking the question above.
I do not wish to use the Athena API/AWS CLI for this.

Update BigQuery permanent external tables

I'm using BigQuery both to store data within "native" BigQuery tables and to query data stored in Google Cloud Storage. According to the documentation, it is possible to query external sources using two types of tables: permanent and temporary external tables.
Consider the following scenario: every day some parquet files are written in GCS, and with a certain frequency I want to do a JOIN between the data stored in a BigQuery table and the data stored in parquet files. If I create a permanent external table, and then I update the files below, is the content of the table automatically updated as well, or do I have to recreate it from the new files?
What are the best practices for such a scenario?
You don't have to re-create the external table again when you add new files into cloud storage bucket. The only exception is, if the number of columns is different in new file then the external table will not work as expected.
You need to use wildcard symbol to read files that matches to a specific pattern rather than providing a static file name. Example: "gs://bucketName/*.csv"

How to use multiple file format in Athena

I have multiple file with different formats (csv, json and parquet) in s3 bucket directory (All files are in same directory). All files have same structure. How can I use these files to create Athena table?
Do we have provision to provide different Serde while creating table?
Edit: Table gets created but there is no data when I preview table.
There are a few options, but in my opinion it is best to create the separate paths (folders) for each type of files and run Glue Crawler on each of them. You will have multiple tables, but you can consolidate them by using Athena views or you can convert these files to one format by using Glue (for instance).
If you want to have the files in one folder you can use include and exclude patterns in Glue Crawler. Also in this case you will have to create seperate table for each type of file.
https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

Run HIVE on S3?

I wants to run SQL queries on S3 files/bucket through HIVE. I have no idea about how to do setup. Appreciate for your help.
You first create an EXTERNAL TABLE that defines the data format and points to a location in Amazon S3:
CREATE EXTERNAL TABLE s3_export(a_col string, b_col bigint, c_col array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://bucketname/path/subpath/';
You can then read from the table using normal SELECT commands, for example:
SELECT b_col FROM s3_export
Alternatively, you can use Amazon Athena to run Hive-like queries against data in Amazon S3 without even requiring a Hadoop cluster. (It is actually based on Presto syntax, which is very similar to Hive.)