Trouble Partitioning my Amazon Spectrum Table - amazon-web-services

Getting this error in particular:
ERROR: Error when calling external catalog API: The number of partition keys do not match the number of partition values

Just posting this because googling that error turned nothing up. This can happen when you ALTER TABLE ADD PARTITION on an external table created without the PARTITIONED BY clause.
Read this for details: CREATE EXTERNAL TABLE docs. In retrospect, you, like me, should have read this before trying this command.

In my case I only specified one partition field in the ALTER TABLE ADD PARTITION statement while in the CREATE EXTERNAL TABLE I declared:
PARTITIONED BY (year varchar(4), month varchar(2), day varchar(2))
So the user is supposed to ADD PARTITION declaring the whole path, e.g.:
alter TABLE my_external_schema.my_external_table
add partition (year='2022', month='01', day='01')
location 's3://my_bucket/year=2022/month=01/day=01/';

Related

Schedule the creation a partitioned table overwriting an existing table in BigQuery GCP

Yesterday I scheduled daily the overwriting of a table. The new table will be partitioned as well as the overwritten one... It did not run at the corresponding time, nor gave an error... It just did not started.
My feeling is that it has to be with the partitioning option. To mention that the casting of the field date_formatted that will be used as partition field works fine.
As far as I know, when scheduling a query you can't use the create or replace table T partitioned by column C as select...
You starts from the select... clause, as shows in the image, and I don't know if the problem comes from here.
PS: I had no troubles scheduling the appending to a partitioned by day table with this same procedure.
the destination table is in the same dataset.
if the very same query is scheduled to deliver the results in a table with the same name, but in a different dataset (located in the same project), it works.
by the way, the input table and the output table never were the same.

BigQuery Hive external partitioning and partition column names

I have created an external BigQuery partitioned table on GCS using Hive partition strategy with Parquet export. All works fine on key:value name on CGS. BQ picks up the partitoned stripes as external columns and shows these colums as added to the table schema.
Let us assume that I have a column called reporting_date on BQ native table. I cannot call my partition stripe as reporting_date=2021-08-31. Otherwise BQ does not allow the external table to be created. So I call it reportingdate=2021-08-31.
so far so good. So I can use a predicate like
select * FROM <MY_DATASET>.<MYEXTERNAL__TABLE> WHERE reportingdate = '2021-08-31'
This will use partition pruning.
However, I can also do the following as well
select * FROM <MY_DATASET>.<MYEXTERNAL__TABLE> WHERE reporting_date = '2021-08-31'
Note that I am using the existing column name NOT the external partitioned column name. Surprisingly this predicate also uses partition pruning. I was wondering how this works as the external partition is built on reportingdate and NOT reporting_date.
My assumption is that underneath the bonnet these two columns use the same storage space? Somehow BigQuery at time of creating the external table remembers the partition column name. Any ideas or clarification will be appreciated.

Querying S3 using Athena

I have a setup with Kinesis Firehose ingesting data, AWS Lambda performing data transformation and dropping the incoming data into an S3 bucket. The S3 structure is organized by year/month/day/hour/messages.json, so all of the actual json files I am querying are at the 'hour' level with all year, month, day directories only containing sub directories.
My problem is I need to run a query to get all data for a given day. Is there an easy way to query at the 'day' directory level and return all files in its sub directories without having to run a query for 2020/06/15/00, 2020/06/15/01, 2020/06/15/02...2020/06/15/23?
I can successfully query the hour level directories since I can create a table and define the column name and type represented in my .json file, but I am not sure how to create a table in Athena (if possible) to represent a day directory with sub directories instead of actual files.
To query only the data for a day without making Athena read all the data for all days you need to create a partitioned table (look at the second example). Partitioned tables are like regular tables, but they contain additional metadata that describes where the data for a particular combination of the partition keys is located. When you run a query and specify criteria for the partition keys Athena can figure out which locations to read and which to skip.
How to configure the partition keys for a table depends on the way the data is partitioned. In your case the partitioning is by time, and the timestamp has hourly granularity. You can choose a number of different ways to encode this partitioning in a table, which one is the best depends on what kinds of queries you are going to run. You say you want to query by day, which makes sense, and will work great in this case.
There are two ways to set this up, the traditional, and the new way. The new way uses a feature that was released just a couple of days ago and if you try to find more examples of it you may not find many, so I'm going to show you the traditional too.
Using Partition Projection
Use the following SQL to create your table (you have to fill in the columns yourself, since you say you've successfully created a table already just use the columns from that table – also fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2020/06/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://cszlos-data/is/here/${date}"
)
This creates a table partitioned by date (please note that you need to quote this in queries, e.g. SELECT * FROM cszlos_firehose_data WHERE "date" = …, since it's a reserved word, if you want to avoid having to quote it use another name, dt seems popular, also note that it's escaped with backticks in DDL and with double quotes in DML statements). When you query this table and specify a criteria for date, e.g. … WHERE "date" = '2020/06/05', Athena will read only the data for the specified date.
The table uses Partition Projection, which is a new feature where you put properties in the TBLPROPERTIES section that tell Athena about your partition keys and how to find the data – here I'm telling Athena to assume that there exists data on S3 from 2020-06-01 up until the time the query runs (adjust the start date necessary), which means that if you specify a date before that time, or after "now" Athena will know that there is no such data and not even try to read anything for those days. The storage.location.template property tells Athena where to find the data for a specific date. If your query specifies a range of dates, e.g. … WHERE "date" > '2020/06/05' Athena will generate each date (controlled by the projection.date.interval property) and read data in s3://cszlos-data/is/here/2020-06-06, s3://cszlos-data/is/here/2020-06-07, etc.
You can find a full Kinesis Data Firehose example in the docs. It shows how to use the full hourly granularity of the partitioning, but you don't want that so stick to the example above.
The traditional way
The traditional way is similar to the above, but you have to add partitions manually for Athena to find them. Start by creating the table using the following SQL (again, add the columns from your previous experiments, and fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
This is exactly the same SQL as above, but without the table properties. If you try to run a query against this table now you will not get any results. The reason is that you need to tell Athena about the partitions of a partitioned table before it knows where to look for data (partitioned tables must have a LOCATION, but it really doesn't mean the same thing as for regular tables).
You can add partitions in many different ways, but the most straight forward for interactive use is to use ALTER TABLE ADD PARTITION. You can add multiple partitions in one statement, like this:
ALTER TABLE cszlos_firehose_data ADD
PARTITION (`date` = '2020-06-06') LOCATION 's3://cszlos-data/is/here/2020/06/06'
PARTITION (`date` = '2020-06-07') LOCATION 's3://cszlos-data/is/here/2020/06/07'
PARTITION (`date` = '2020-06-08') LOCATION 's3://cszlos-data/is/here/2020/06/08'
PARTITION (`date` = '2020-06-09') LOCATION 's3://cszlos-data/is/here/2020/06/09'
If you start reading more about partitioned tables you will probably also run across the MSCK REPAIR TABLE statement as a way to load partitions. This command is unfortunately really slow, and it only works for Hive style partitioned data (e.g. …/year=2020/month=06/day=07/file.json) – so you can't use it.

Create a table using `pg_table_def` data in Redshift or DBT

To create a table from all the data in pg_table_def that is visible to my user, I tried:
create table adhoc_schema.pg_table_dump as (
select *
from pg_table_def
);
But it throws an error:
Column "schemaname" has unsupported type "name"
Any way to create a table from pg_table_def or information_schema.columns?
Found this from another thread, which seems like it would help.
Amazon considers the internal functions that INFORMATION_SCHEMA.COLUMNS is using Leader-Node Only functions. Rather than being sensible and redefining the standardized INFORMATION_SCHEMA.COLUMNS, Amazon sought to define their own proprietary version. For that they made available another function PG_TABLE_DEF which seems to address the same need. Pay attention to the note in the center about adding the schema to search_path.
Stores information about table columns.
PG_TABLE_DEF only returns information about tables that are visible to the user. If PG_TABLE_DEF does not return the expected results, verify that the search_path parameter is set correctly to include the relevant schemas.
You can use SVV_TABLE_INFO to view more comprehensive information about a table, including data distribution skew, key distribution skew, table size, and statistics.
So using your example code (rewritten to use NOT EXISTS for clarity),
SET SEARCH_PATH to '$user', 'public', 'target_schema';
SELECT "column"
FROM dev.fields f
WHERE NOT EXISTS (
SELECT 1
FROM PG_TABLE_DEF pgtd
WHERE pgtd.column = f.field
AND schemaname = 'target_schema'
);
See also,
Official docs on Querying Redshift System Tables: https://docs.aws.amazon.com/redshift/latest/dg/t_querying_redshift_system_tables.html
pg_table_def is a leader-node system table with some additional data types not supported in user tables. You will need to cast to text first.
However, this still won't work because pg_table_def is a leader-node table and the table you are creating is a user table stored on the compute nodes. You will need to pull your source information from tables that are stored on compute-nodes. Since I don't know what information you are looking for from pg_table_def I cannot say exactly which ones you need but you can start with stv_tbl_perm and join in pg_class and other tables as more info is needed.

Big Query: Query Failed. Dataset Not Found

Guys I imported a table from .csv to my dataset in a project.
Then I preview my table it's shown, but whenever I ran query table, it always responded with
Query Failed
Error: Not found: Dataset <project-id>:<table-name>. Please verify that the dataset exists and the correct location was used for the job.
Here's my query
SELECT distinct(customer_id) as cust_id FROM [<project-id>:<table-name>.orders] LIMIT 1000
Is there anything wrong?
Or how should I query an imported table?
From your question, I see that you are using as table name <project-id>:<table-name>, but as you can see in this documentation page, the correct naming for project-qualified table definitions is the following:
#legacySQL
[PROJECT_ID:DATASET.TABLE]
#standardSQL
`PROJECT_ID.DATASET.TABLE`
I see you are using Legacy SQL (by the usage of square brackets [ ]), so you should go with the first naming definition, but you are missing the dataset name between the project and the table.
Additionally, I see you are appending orders to the table name, but it is not clear what that is, given that you hid the table name as <table-name>.
Additionally, make sure that, if your dataset is not located in the US or EU, you specify the location when running the query, as explained in this entry in the documentation.