"Iceberg query cannot be parsed" when trying to create Iceberg table with MAP column data type in Athena? - amazon-athena

According to the Athena Iceberg documentation, the map type is supported.
Why do neither of these statements work?
CREATE TABLE iceberg_test1 (id string, themap map)
LOCATION 's3://mybucket/test/iceberg1'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' );
Error:
Iceberg query cannot be parsed
Second try:
CREATE TABLE iceberg_test1 (id string, themap map<varchar,varchar>)
LOCATION 's3://mybucket/test/iceberg1'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' );
Same error:
Iceberg query cannot be parsed

Neither map (without specifying the key/value types) nor varchar are valid Iceberg types. See the Iceberg documentation for valid types.
CREATE TABLE iceberg_test1 (id string, themap map<string, string>)
LOCATION 's3://mybucket/test/iceberg1'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' );
will work as long as you have permissions to access the S3 location.

Related

Insert a WHERE conditional in Table creation code for AWS Athena

I need to create a table with a specific condition that can be updated when the bucket is updated. This is an example:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`cards-test` (
`id` bigint,
`created_at` timestamp,
`type` string,
`account_id` bigint,
`last_4_digits` string,
`is_active` boolean,
`status` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://something/cards-bucket/'
TBLPROPERTIES ('classification' = 'parquet');
Now, let's say I want a WHERE clause that says WHERE type = 'type_1', can I insert this here? If so, where?
If not, how should I create a table with such specific conditions out of the buckets?
No, as doc show the syntax for CREATE TABLE - there is no option to provide filtering the data.
What you can do - create another table via CREATE TABLE AS syntax with filtering applied:
CREATE TABLE cards-test-type_1 WITH (
...
) AS
SELECT
*
FROM
cards-test
WHERE type = 'type_1'
Or create a view:
Creates a new view from a specified SELECT query. The view is a logical table that can be referenced by future queries. Views do not contain any data and do not write data. Instead, the query specified by the view runs each time you reference the view by another query.
CREATE VIEW cards-test-type_1 AS
SELECT
*
FROM
cards-test
WHERE type = 'type_1'

read partitioned data of vpc flow log

I used this article to read my vpc flow logs and everything worked correctly.
https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/
But my question is that when I refer to documentation and run the create table statement, it does not return any record.
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`region` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`flow_direction` string,
`traffic_path` int
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` string,
`month` string,
`day` string,
`hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://DOC-EXAMPLE-BUCKET/prefix/AWSLogs/aws-account-id={account_id}/aws-service=vpcflowlogs/aws-region={region_code}/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)
official doc:
https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html
This create table statement should work after changing the variables like DOC-EXAMPLE-BUCKET/prefix, account_id and region_code. Why am I getting 0 rows returned for select * query?
You need to manually load the partitions first before you could use them.
From the docs:
After you create the table, you load the data in the partitions for querying. For Hive-compatible data, you run MSCK REPAIR TABLE. For non-Hive compatible data, you use ALTER TABLE ADD PARTITION to add the partitions manually.
So if your structure if hive compatible you can just run:
MSCK REPAIR TABLE `table name`;
And this will load all your new partitions.
Otherwise you'll have to manually load them using ADD PARTITION
ALTER TABLE test ADD PARTITION (aws-account-id='1', aws-acount-service='2' ...) location 's3://bucket/subfolder/data/accountid1/service2/'
Because manually adding partitions is so tedious if your data structure is not hive compatible I recommend you use partition projection for your table.
To avoid having to manage partitions, you can use partition projection. Partition projection is an option for highly partitioned tables whose structure is known in advance. In partition projection, partition values and locations are calculated from table properties that you configure rather than read from a metadata repository. Because the in-memory calculations are faster than remote look-up, the use of partition projection can significantly reduce query runtimes.

Amazon Athena Error Creating Partitioned Tables

I am new to Athena, and would request for some help.
I have multiple csv files in the following format. Pls note all fields are in double quotes. And total file size is about 5GB. If possible, I would rather do this without the use of Glue. Unless there is a reason to spend $ on running the crawlers.
"emailusername.string()","emaildomain.string()","name.string()","details.string()"
"myname1","website1.com","fullname1","address1 n details"
"myname2","website2.com","fullname2","address2 n details"
The following code on Athena works perfectly:
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`emailusername` string,
`emaildomain` string,
`name` string,
`details` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
However I am neither able to cluster, nor use partitioning. The following code runs successfully. Post that I am also able to Load Partitions successfully. But no data is returned!
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`name` string,
`details` string
)
PARTITIONED BY (emaildomain string, emailusername string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
MSCK REPAIR TABLE tablea;
SELECT * FROM "db1"."tablea";
Result: Zero records returned
If your intention is to create partitions on emaildomain, emailusername
You don’t need to have fields called emaildomain, emailusername in the table. However, you need to have 2 directories as domain1/user1 under your s3 location.
e.g. s3://projectzzzz2/0001_aaaa_delme/domain1/user1
make sure
copy your file to s3://projectzzzz2/0001_aaaa_delme ( not to the location s3://projectzzzz2/0001_aaaa_delme/domain1/user1)
then you can issue
ALTER TABLE tablea ADD PARTITION (emaildomain ='domain1', emailusername= 'user1') location ‘s3://projectzzzz2/0001_aaaa_delme/domain1/user1' ;
If you query the table tablea you will see new fields called emaildomain and emailusername been added automatically
As of my knowledge, whenever you add a new user or new email domain then you need to copy your file into the new folder and need to issue the ‘Alter table’ statement accordingly.

How to load key-value pairs (MAP) into Athena from Parquet file?

I have an S3 bucket full of .gz.parquet files. I want to make them accessible in Athena. In order to do this I am creating a table in Athena that points at the s3 bucket:
CREATE EXTERNAL TABLE user_db.table (
pan_id bigint,
dev_id bigint,
parameters ?????,
start_time_local bigint
)
STORED AS PARQUET
LOCATION ‘s3://bucket/path/to/folder/containing_files/’
tblproperties (“parquet.compression”=“GZIP”)
;
How do I correctly specify the data type for the parameters column?
Using # parquet-tools schema, I see the following schema of the data files:
optional int64 pan_id;
optional int64 dev_id;
optional group parameters (MAP) {
repeated group key_value {
required binary key (UTF8);
optional binary value (UTF8);
}
}
optional int96 start_time_local;
Using # parquet-tools head, I see the following value for one row of data:
pan_id = 1668490
dev_id = 6843371
parameters:
.key_value:
..key = doc_id
..value = c2bd3593d7015fb912d4de229a302379babcf6a00a203fcf
.key_value:
..key = variables
..value = {“video_id”:“2313675068886132",“surface”:“post”}
start_time_local = QFOHvvYvAAAzhCUA
I appreciate any help you can give. I have not been able to find good documentation for the MAP datatype being used in CREATE TABLE.
Maps are declared as map<string,string> (for string-to-string maps, other types are also possible), in your case the whole table DDL would be:
CREATE EXTERNAL TABLE user_db.table (
pan_id bigint,
dev_id bigint,
parameters map<string,string>,
start_time_local bigint
)
STORED AS PARQUET
LOCATION 's3://bucket/path/to/folder/containing_files/'
tblproperties ("parquet.compression" = "GZIP")
The map type is the second to last in the list in the list of Athena data types
You can use an AWS Glue Crawler to automatically derive the schema from your Parquet files.
Defining AWS Glue Crawlers: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

Reading Json data from Athena

i have created a table by mapping the json data, unfortunately i am not able to read the nested array within the json.
{
"total":10,
"count":100,
"values":{
"source":[{"sourceid":"10001","source":"ABC"},
{"sourceid":"10002","source":"XYZ"}
]}
}
```athena table
CREATE EXTERNAL TABLE source_master_data(
total bigint,
count bigint,
values struct<source: array<struct<sourceid: string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sourcemaster/'
I am trying to read the sourceid and source but no luck.. can anyone help me out
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data.Values) AS t1
The unnest need to be placed on the array type. In your query, you are trying to unnest the struct which is not possible in Athena.
The second issue is the use of values without quotes. This also fails, because values is a reserved word in Athena.
The overall query would look something like this.
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data."values".source) AS t1 (source)