How to load key-value pairs (MAP) into Athena from Parquet file? - amazon-web-services

I have an S3 bucket full of .gz.parquet files. I want to make them accessible in Athena. In order to do this I am creating a table in Athena that points at the s3 bucket:
CREATE EXTERNAL TABLE user_db.table (
pan_id bigint,
dev_id bigint,
parameters ?????,
start_time_local bigint
)
STORED AS PARQUET
LOCATION ‘s3://bucket/path/to/folder/containing_files/’
tblproperties (“parquet.compression”=“GZIP”)
;
How do I correctly specify the data type for the parameters column?
Using # parquet-tools schema, I see the following schema of the data files:
optional int64 pan_id;
optional int64 dev_id;
optional group parameters (MAP) {
repeated group key_value {
required binary key (UTF8);
optional binary value (UTF8);
}
}
optional int96 start_time_local;
Using # parquet-tools head, I see the following value for one row of data:
pan_id = 1668490
dev_id = 6843371
parameters:
.key_value:
..key = doc_id
..value = c2bd3593d7015fb912d4de229a302379babcf6a00a203fcf
.key_value:
..key = variables
..value = {“video_id”:“2313675068886132",“surface”:“post”}
start_time_local = QFOHvvYvAAAzhCUA
I appreciate any help you can give. I have not been able to find good documentation for the MAP datatype being used in CREATE TABLE.

Maps are declared as map<string,string> (for string-to-string maps, other types are also possible), in your case the whole table DDL would be:
CREATE EXTERNAL TABLE user_db.table (
pan_id bigint,
dev_id bigint,
parameters map<string,string>,
start_time_local bigint
)
STORED AS PARQUET
LOCATION 's3://bucket/path/to/folder/containing_files/'
tblproperties ("parquet.compression" = "GZIP")
The map type is the second to last in the list in the list of Athena data types

You can use an AWS Glue Crawler to automatically derive the schema from your Parquet files.
Defining AWS Glue Crawlers: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

Related

Insert a WHERE conditional in Table creation code for AWS Athena

I need to create a table with a specific condition that can be updated when the bucket is updated. This is an example:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`cards-test` (
`id` bigint,
`created_at` timestamp,
`type` string,
`account_id` bigint,
`last_4_digits` string,
`is_active` boolean,
`status` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://something/cards-bucket/'
TBLPROPERTIES ('classification' = 'parquet');
Now, let's say I want a WHERE clause that says WHERE type = 'type_1', can I insert this here? If so, where?
If not, how should I create a table with such specific conditions out of the buckets?
No, as doc show the syntax for CREATE TABLE - there is no option to provide filtering the data.
What you can do - create another table via CREATE TABLE AS syntax with filtering applied:
CREATE TABLE cards-test-type_1 WITH (
...
) AS
SELECT
*
FROM
cards-test
WHERE type = 'type_1'
Or create a view:
Creates a new view from a specified SELECT query. The view is a logical table that can be referenced by future queries. Views do not contain any data and do not write data. Instead, the query specified by the view runs each time you reference the view by another query.
CREATE VIEW cards-test-type_1 AS
SELECT
*
FROM
cards-test
WHERE type = 'type_1'

How to export hive table data into csv.gz format stored in s3

So I have two hive queries, one that creates the table and the other one that reads parquet data from another table and inserts the relevant columns into my new table. I would like this new hive table to export its data to an s3 location with data in csv.gz format. My hive queries running on emr are currently outputting 00000_0.gz and I have to rename them using a bash script to csv.gz. This is quite a hacky way as I have to mount my s3 directory into my terminal and it would be ideal if my queries could directly do this. Could someone please review my queries to see where if there's any fault, many thanks.
CREATE TABLE db.test (
app_id string,
app_account_id string,
sdk_ts BIGINT,
device_id string)
PARTITIONED BY (
load_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION "s3://test_unload/";
set hive.execution.engine=tez;
set hive.cli.print.header=true;
set hive.exec.compress.output=true;
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=1024000000;
set hive.merge.size.per.task=1024000000;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into db.test
partition(load_date)
select
'' as app_id,
'288' as app_account_id,
from_unixtime(CAST(event_epoch as BIGINT), 'yyyy-MM-dd HH:mm:ss') as sdk_ts,
device_id,
'20221106' as load_date
FROM processed_events.test
where load_date = '20221106'; ```

AWS Athena create external table succeeds even if AWS s3 doesn't have file in it?

create external table reason ( reason_id int,
retailer_id int,
reason_code string,
reason_text string,
ordering int,
creation_date date,
is_active tinyint,
last_updated_by int,
update_date date
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE
location 's3://bucket_name/athena-workspace/athena-input/'
TBLPROPERTIES ("skip.header.line.count"="1");
Query above successfuly executes, however, there is no files in the provided location!!!
Upon successful execution table is created and is empty. How is this possible?
Even if I upload file to the provided location, created table is still empty!!
Athena is not a data store, it is simply a serverless tool to read data in S3 using SQL like expressions.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
This query is creating the metadata of the table, it doesn't write to that location it reads from it.
If you put a CSV into the location and performed select * from reason it would attempt to map any CSV in the prefix of athena-workspace/athena-input/ within bucket bucket_name to your data format using the ROW FORMAT and SERDEPROPERTIES to parse the files. It would also skip the first line assuming its a header.

Reading Json data from Athena

i have created a table by mapping the json data, unfortunately i am not able to read the nested array within the json.
{
"total":10,
"count":100,
"values":{
"source":[{"sourceid":"10001","source":"ABC"},
{"sourceid":"10002","source":"XYZ"}
]}
}
```athena table
CREATE EXTERNAL TABLE source_master_data(
total bigint,
count bigint,
values struct<source: array<struct<sourceid: string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sourcemaster/'
I am trying to read the sourceid and source but no luck.. can anyone help me out
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data.Values) AS t1
The unnest need to be placed on the array type. In your query, you are trying to unnest the struct which is not possible in Athena.
The second issue is the use of values without quotes. This also fails, because values is a reserved word in Athena.
The overall query would look something like this.
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data."values".source) AS t1 (source)

Amazon Athena : How to store results after querying with skipping column headers?

I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.