I'm using BigQuery both to store data within "native" BigQuery tables and to query data stored in Google Cloud Storage. According to the documentation, it is possible to query external sources using two types of tables: permanent and temporary external tables.
Consider the following scenario: every day some parquet files are written in GCS, and with a certain frequency I want to do a JOIN between the data stored in a BigQuery table and the data stored in parquet files. If I create a permanent external table, and then I update the files below, is the content of the table automatically updated as well, or do I have to recreate it from the new files?
What are the best practices for such a scenario?
You don't have to re-create the external table again when you add new files into cloud storage bucket. The only exception is, if the number of columns is different in new file then the external table will not work as expected.
You need to use wildcard symbol to read files that matches to a specific pattern rather than providing a static file name. Example: "gs://bucketName/*.csv"
Related
I have a partitioned location on S3 with data I want to read via Redshift External Table, which I create with the SQL statement CREATE EXTERNAL TABLE....
Only thing is that I have some metadata files within these partitions with, for example, extension .txt while the data I'm reading is .json.
Is it possible to inform Redshift to skip those files, in a similar manner to Glue Crawler exclude patterns?
e.g. Glue crawler exclude patterns
Can you try using the pseudocolumns in the SQL and excluding based on the path name?
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE_usage.html
create external table as ....
select .....
where "$path" like '%.json'
I am uploading CSV files in the s3 bucket and creating tables through glue crawler and seeing the tables in Athena, making connection between Athena and Quicksight, and showing the result graphically there in quicksight.
But what I need to do now is keep the history of the files uploaded, instead of a new CSV file being uploaded and crawler updating the table, can I have crawler save each record separately? or is it even a reasonable thing to do? since I wonder it would then create so many tables and it'll be a mess?
I'm just trying to figure out a way to keep a history of previous records. how can i achieve this?
When you run an Amazon Athena query, Athena will look at the location parameter defined in the table's DDL. This specifies where the data is stored in an Amazon S3 bucket.
Athena will include all files in that location when it runs the query on that table. Thus, if you wish to add more data to the table, simply add another file in that S3 location. To replace data in that table, you can overwrite the file(s) in that location. To delete data, you can delete files from that location.
There is no need to run a crawler on a regular basis. The crawler can be used to create the table definition and it can be run again to update the table definition if anything has changed. But you typically only need to use the crawler once to create the table definition.
If you wish to preserve historical data in the table while adding more data to the table, simply upload the data to new files and keep the existing data files in place. That way, any queries will include both the historical data and the new data because Athena simply looks at all the files in that location.
When I run CREATE EXTERNAL SCHEMA, I automatically see the tables inside that schema.
Recently, I've been watching some tutorial videos where CREATE EXTERNAL TABLE is run after CREATE EXTERNAL SCHEMA. What's the point of CREATE EXTERNAL TABLE if CREATE EXTERNAL SCHEMA already links to those tables?
External schema can be of various types. One can create external schemas from Glue Data Catalog, another Redshift local database, or a remote Postgres or MySQL DB, etc.
Some of these are read-only. Both in terms of data and metadata. i.e. One cannot insert a new record in any of the tables and one cannot create another table in the schema. The second and third types mentioned above fall into this category.
On the other hand, a schema created from Glue Catalog is read-only in terms of data. But one can add new tables to it. To create tables on top of files in this schema, we need the CREATE EXTERNAL SCHEMA statement.
You can either choose to create these tables through Redshift or you can create them through Athena or Glue Crawlers etc.
I want to create a stored Proc which can read data from GCS bucket and store into the table in bigquery. I was able to do it using python bu conecting to gcs and creating bigquery client.
credentials = service_account.Credentials.from_service_account_file(path_to_key)
bq_client = bigquery.Client(credentials=credentials, project=project_id)
Can we achieve the same using stored procedure?
Have you looked into using an external table? You can query directly off of Google Cloud Storage without needing to load anything. Just define the schema of the expected data along with the GCS URI's and once the data is in GCS it is accessible via SQL in BigQuery. Otherwise, no. There is no LOAD statement that you can execute via BigQuery SQL. See the docs here for all of the ways to load data into a table. You could have the external table and maybe create a stored procedure that does an INSERT into another table using the data from the external table you created. That is if you are really hell bent on having a stored procedure to "load" data to a normal BigQuery table. Otherwise the external tables are an excellent option to obviate the need to even load the data in the first place.
I have a star schema kind of database structure, like one fact table having all the id’s & skeys, whereas there are multiple dimension tables having the actual id, code, descriptions for the id’s referred in the fact table.
we are moving all these tables (fact & dimensions) to S3 (cloud) individually and each table data are split into multiple parquet files in S3 location (one S3 object per table)
Query: i need to perform a transformation on cloud (ie) i need strip of all the id’s & skeys referred in the fact table and replace it with the actual code that is residing in the dimension tables and create another file and store the final output back in S3 location. This file will later be consumed by Redshift for Analytics.
My Doubt:
Whats the best way to achieve this solution, cos i don’t need raw data (skeys & id’s) in Redshift for cost and storage optimization?
Do we need to first combine these split files (parquet) into one large file (ie) before performing the data transformation. Also, after data transformation, I am planning to save the final output file in parquet format, but the catch is, Redshift doesn’t allow copy of parquet file, so is there a workaround for that
I am not a hardcore programmer and want to avoid using scala/python in a EMR, but I am good at SQL, so is there a way to perform data transformation in cloud thru SQL thru EMR and save the output data into a file or files. Please advise
You should be able to run redshift type queries directly against your s3 parquet data by using amazon athena
some information on that
https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/