I have a folder containing files in parquet format. I used crawler to create table defined in Glue Data Catalog which counted to 2500+ columns. I want to create External Table on top of it in redshift.
But all the articles that I read have mentioned the columns explicitly.
Is there any way so that the Table reads schema directly from the table in data catalog and I don't have to feed it separately?
You can create an external schema in Redshift which is based on a data catalog. This way, you will see all tables in the data catalog without creating them in Redshift.
create external schema spectrum_schema
from data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
create external database if not exists;
In the above example from the documentation, spectrum_db is the name of your data catalog.
Related
I have a partitioned location on S3 with data I want to read via Redshift External Table, which I create with the SQL statement CREATE EXTERNAL TABLE....
Only thing is that I have some metadata files within these partitions with, for example, extension .txt while the data I'm reading is .json.
Is it possible to inform Redshift to skip those files, in a similar manner to Glue Crawler exclude patterns?
e.g. Glue crawler exclude patterns
Can you try using the pseudocolumns in the SQL and excluding based on the path name?
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE_usage.html
create external table as ....
select .....
where "$path" like '%.json'
When I run CREATE EXTERNAL SCHEMA, I automatically see the tables inside that schema.
Recently, I've been watching some tutorial videos where CREATE EXTERNAL TABLE is run after CREATE EXTERNAL SCHEMA. What's the point of CREATE EXTERNAL TABLE if CREATE EXTERNAL SCHEMA already links to those tables?
External schema can be of various types. One can create external schemas from Glue Data Catalog, another Redshift local database, or a remote Postgres or MySQL DB, etc.
Some of these are read-only. Both in terms of data and metadata. i.e. One cannot insert a new record in any of the tables and one cannot create another table in the schema. The second and third types mentioned above fall into this category.
On the other hand, a schema created from Glue Catalog is read-only in terms of data. But one can add new tables to it. To create tables on top of files in this schema, we need the CREATE EXTERNAL SCHEMA statement.
You can either choose to create these tables through Redshift or you can create them through Athena or Glue Crawlers etc.
I have my Parquet file in S3. I want to load this to the redshift table. I don't know the schema of the Parquet file.
Is there any command to create a table and then copy parquet data to it?
Also, I want to add the default time column date timestamp DEFAULT to_char(CURRDATE, 'YYYY-MM-DD').
You need first to create an external schema. Normal colunar schemas do not support parquet files.
Them you need to create a external table to support the columns on the s3 file. So look for a good relation between the file and datatypes. I usually avoid smalls ints for example. To floats create as REAL data type.
Then, do the copy command bellow:
COPY schema_name.table_name
FROM bucket_object iam_role user_arn
FORMAT PARQUET;
I want to create a stored Proc which can read data from GCS bucket and store into the table in bigquery. I was able to do it using python bu conecting to gcs and creating bigquery client.
credentials = service_account.Credentials.from_service_account_file(path_to_key)
bq_client = bigquery.Client(credentials=credentials, project=project_id)
Can we achieve the same using stored procedure?
Have you looked into using an external table? You can query directly off of Google Cloud Storage without needing to load anything. Just define the schema of the expected data along with the GCS URI's and once the data is in GCS it is accessible via SQL in BigQuery. Otherwise, no. There is no LOAD statement that you can execute via BigQuery SQL. See the docs here for all of the ways to load data into a table. You could have the external table and maybe create a stored procedure that does an INSERT into another table using the data from the external table you created. That is if you are really hell bent on having a stored procedure to "load" data to a normal BigQuery table. Otherwise the external tables are an excellent option to obviate the need to even load the data in the first place.
I'm using BigQuery both to store data within "native" BigQuery tables and to query data stored in Google Cloud Storage. According to the documentation, it is possible to query external sources using two types of tables: permanent and temporary external tables.
Consider the following scenario: every day some parquet files are written in GCS, and with a certain frequency I want to do a JOIN between the data stored in a BigQuery table and the data stored in parquet files. If I create a permanent external table, and then I update the files below, is the content of the table automatically updated as well, or do I have to recreate it from the new files?
What are the best practices for such a scenario?
You don't have to re-create the external table again when you add new files into cloud storage bucket. The only exception is, if the number of columns is different in new file then the external table will not work as expected.
You need to use wildcard symbol to read files that matches to a specific pattern rather than providing a static file name. Example: "gs://bucketName/*.csv"