Redshift - What is CREATE EXTERNAL TABLE command for? - amazon-web-services

When I run CREATE EXTERNAL SCHEMA, I automatically see the tables inside that schema.
Recently, I've been watching some tutorial videos where CREATE EXTERNAL TABLE is run after CREATE EXTERNAL SCHEMA. What's the point of CREATE EXTERNAL TABLE if CREATE EXTERNAL SCHEMA already links to those tables?

External schema can be of various types. One can create external schemas from Glue Data Catalog, another Redshift local database, or a remote Postgres or MySQL DB, etc.
Some of these are read-only. Both in terms of data and metadata. i.e. One cannot insert a new record in any of the tables and one cannot create another table in the schema. The second and third types mentioned above fall into this category.
On the other hand, a schema created from Glue Catalog is read-only in terms of data. But one can add new tables to it. To create tables on top of files in this schema, we need the CREATE EXTERNAL SCHEMA statement.
You can either choose to create these tables through Redshift or you can create them through Athena or Glue Crawlers etc.

Related

How to skip files with specific extension on Redshift external tables?

I have a partitioned location on S3 with data I want to read via Redshift External Table, which I create with the SQL statement CREATE EXTERNAL TABLE....
Only thing is that I have some metadata files within these partitions with, for example, extension .txt while the data I'm reading is .json.
Is it possible to inform Redshift to skip those files, in a similar manner to Glue Crawler exclude patterns?
e.g. Glue crawler exclude patterns
Can you try using the pseudocolumns in the SQL and excluding based on the path name?
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE_usage.html
create external table as ....
select .....
where "$path" like '%.json'

How to create External Table without specifying columns in Redshift?

I have a folder containing files in parquet format. I used crawler to create table defined in Glue Data Catalog which counted to 2500+ columns. I want to create External Table on top of it in redshift.
But all the articles that I read have mentioned the columns explicitly.
Is there any way so that the Table reads schema directly from the table in data catalog and I don't have to feed it separately?
You can create an external schema in Redshift which is based on a data catalog. This way, you will see all tables in the data catalog without creating them in Redshift.
create external schema spectrum_schema
from data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
create external database if not exists;
In the above example from the documentation, spectrum_db is the name of your data catalog.

All redshift external schemas showing all the tables. How to limit visibility of external tables in an external schema?

I have a AWS redshift cluster (say Cluster A) and a database (say db A) in it. I have created an external schema (say sch A) and created several external tables in it, which have their data in s3.
Now, I want to create another external schema (say sch B) where I want to create some other external tables. My intention is to create these schemas from admin user and provide separate users access to separate schemas using GRANT USAGE ON.
However, after creating sch B, I observe that all tables which I created in sch A are visible and could be queried.
Any idea about why is this happening and how to prevent this. Please suggest.
Note- I created both the external schemas with same IAM role which has policy to read the s3 bucket where the data resides. I don't think this is the issue.
You are running into a confusion lots of people run into.
An external schema is badly misnamed - it is not like a schema at all.
An external schema is and is only a pointer to an external database. The external table metadata is stored in that extenral database.
You can have any number of external schemas pointing to the same external database, and when you create a table using any of those external schema, the table metadata will be written to the external database the external schema points at - which means, of course, it will be seen through all the other external schemas because they all point at the same external database.
If you want the content of the schemas to be visible in only the one schema, you need to make a new external database for each schema. Note that having too many external databases seems (I've yet to investigate properly, but it looks like it) to make the system tables which carry external table information extremely slow - tens of minutes to return a query. I never use GUI tools for SQL, but I could imagine they are querying this system table and it could for them be a problem.

Update BigQuery permanent external tables

I'm using BigQuery both to store data within "native" BigQuery tables and to query data stored in Google Cloud Storage. According to the documentation, it is possible to query external sources using two types of tables: permanent and temporary external tables.
Consider the following scenario: every day some parquet files are written in GCS, and with a certain frequency I want to do a JOIN between the data stored in a BigQuery table and the data stored in parquet files. If I create a permanent external table, and then I update the files below, is the content of the table automatically updated as well, or do I have to recreate it from the new files?
What are the best practices for such a scenario?
You don't have to re-create the external table again when you add new files into cloud storage bucket. The only exception is, if the number of columns is different in new file then the external table will not work as expected.
You need to use wildcard symbol to read files that matches to a specific pattern rather than providing a static file name. Example: "gs://bucketName/*.csv"

Is there any way we can run multiple SQL queries at the same time in Athena Database

I have to create 20 table in a Athena data base at the same time. Can I do it with a single execution.
example :
CREATE EXTERNAL TABLE IF NOT EXISTS database_1.A
;
CREATE EXTERNAL TABLE IF NOT EXISTS database_1.B
;
CREATE EXTERNAL TABLE IF NOT EXISTS database_1.C
I have used aws cli for such problems.
create a list of sqls.
sql_list.txt
CREATE EXTERNAL TABLE IF NOT EXISTS database_1.A;
CREATE EXTERNAL TABLE IF NOT EXISTS database_1.B;
CREATE EXTERNAL TABLE IF NOT EXISTS database_1.C;
----------
exec_sqls.sh
input_file=$1
while IFS= read -r sql
do
echo "$line"
aws athena start-query-execution --query-string "$sql" --result-configuration S3LocationForOutput=s3://<bucket>
done < "$input_file"
-----------
sh -x exec_sqls.sh sql_list.txt
You can submit multiple requests simultaneously to Amazon Athena (eg via different threads in your application), but each Amazon Athena command can only execute a single SQL query/command.
I have a similar solution but using Redshift eternal tables and Dbeaver.
Changes made to external tables will be reflected automatically on Athena.
By using Dbeaver I'm able to run several DDLs on a single execution.
Minor changes are required:
Update datatypes for each column from Athena to Redshift
Update database name from Athena to Redshift's schema name
Syntax to create partitioned tables. Only if applies