Athena Produces Strange Results When Working With AWS Glue Table - amazon-web-services

I have recently started exploring some ETL solutions on AWS and came across AWS glue. Thus far it's proving to be a major time sink for no real results but hopefully that's because of my lack of experience with it.
I have plenty of data in S3 bucket all residing in .json files. A single file contains an array of objects. I've created a crawler to go through them with a json classifier ($[*]). I run the crawler and the table gets created/updated.
Here's where my problem begins - when trying to query this data through Athena i get the following issue
Athena puts the whole json object inside every single column of my table definition. I've searched everywhere, tried different serializers in the table definition and nothing seems to help. I've been reading that athena requires objects to be in a single line but i can't control how aws glue packs the data in a table.
If anyone encountered similar issues or know what this could be i would be eternally grateful. Thanks!
Edit 1 :
As requested in the comments here's what SHOW CREATE TABLE looks like for the table
CREATE EXTERNAL TABLE `calcal`(
`id` string COMMENT 'from deserializer',
`status` string COMMENT 'from deserializer',
`active` string COMMENT 'from deserializer',
`title` string COMMENT 'from deserializer',
`minimum_value` string COMMENT 'from deserializer',
`maximum_value` string COMMENT 'from deserializer',
`value_operator` string COMMENT 'from deserializer',
`value_notes` string COMMENT 'from deserializer',
`sector` string COMMENT 'from deserializer',
`construction_type` string COMMENT 'from deserializer',
`early_planning_and_development_date` string COMMENT 'from deserializer',
`bidding_date` string COMMENT 'from deserializer',
`pre_construction_date` string COMMENT 'from deserializer',
`construction_date` string COMMENT 'from deserializer',
`project_schedule` string COMMENT 'from deserializer',
`bid_due_date` string COMMENT 'from deserializer',
`bidding_information` string COMMENT 'from deserializer',
`start_date` string COMMENT 'from deserializer',
`start_date_notes` string COMMENT 'from deserializer',
`end_date` string COMMENT 'from deserializer',
`end_date_notes` string COMMENT 'from deserializer',
`address1` string COMMENT 'from deserializer',
`address2` string COMMENT 'from deserializer',
`city` string COMMENT 'from deserializer',
`state` string COMMENT 'from deserializer',
`zip` string COMMENT 'from deserializer',
`county` string COMMENT 'from deserializer',
`other_info` string COMMENT 'from deserializer',
`details` string COMMENT 'from deserializer',
`update_notes` string COMMENT 'from deserializer',
`material_classification` string COMMENT 'from deserializer',
`submitted_by` string COMMENT 'from deserializer',
`created_at` string COMMENT 'from deserializer',
`updated_at` string COMMENT 'from deserializer',
`deleted_at` string COMMENT 'from deserializer',
`types` array<struct<id:string,name:string,code:string,pivot:struct<project_id:string,project_type_id:string>>> COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`stage` struct<id:string,name:string> COMMENT 'from deserializer')
PARTITIONED BY (
`partition_0` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='active,address1,address2,bid_due_date,bidding_date,bidding_information,city,construction_date,construction_type,county,created_at,deleted_at,details,early_planning_and_development_date,end_date,end_date_notes,id,material_classification,maximum_value,minimum_value,other_info,pre_construction_date,project_schedule,sector,stage,start_date,start_date_notes,state,status,submitted_by,title,types,update_notes,updated_at,user,value_notes,value_operator,zip')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://dev-smart-dw/Construction/CAL/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='CAL-Test-2',
'averageRecordSize'='1261',
'classification'='json',
'compressionType'='none',
'jsonPath'='$[*]',
'objectCount'='36',
'recordCount'='1800',
'sizeKey'='2285975',
'typeOfData'='file')
The s3 data is just a bunch of json files that contain a single array with 50 of these objects inside. 1 array per file, multiple files split into multiple folders.

Related

How to transform data in Amazon Athena

I have some data in S3 location in json format. It have 4 columns val, time__stamp, name and type. I would like to create an external Athena table from this data with some transformations given below:
timestamp: timestamp should be converted from unix epoch to UTC, this I did by using the timestamp data type.
name: name should filtered with following sql logic:
name not in ('abc','cdf','fgh') and name not like '%operator%'
type: type should not have values labeled as counter
I would like to add two partition columns date and hour which should be derived from time__stamp column
I started with following:
CREATE EXTERNAL TABLE `airflow_cluster_data`(
`val` string COMMENT 'from deserializer',
`time__stamp` timestamp COMMENT 'from deserializer',
`name` string COMMENT 'from deserializer',
`type` string COMMENT 'from deserializer')
PARTITIONED BY (
date,
hour)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.time_stamp'='#timestamp')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket1/raw/airflow_data'
I tried various things but couldn't figure out the syntax. Using spark could have been easier but I don't want to run Amazon EMR every hour for small data set. I prefer to do it in Athena if possible.
Please have a look at some sample data:
1533,1636674330000,abc,counter
1533,1636674330000,xyz,timer
1,1636674330000,cde,counter
41,1636674330000,cde,timer
1,1636674330000,fgh,counter
231,1636674330000,xyz,timer
1,1636674330000,abc,counter
2431,1636674330000,cde,counter
42,1636674330000,efg,timer
Probably the simplest method is to create a View:
CREATE VIEW foo AS
SELECT
val,
cast(from_unixtime(time__stamp / 1000) as timestamp) as timestamp,
cast(from_unixtime(time__stamp / 1000) as date) as date,
hour(cast(from_unixtime(time__stamp / 1000) as timestamp)) as hour,
name,
type
FROM airflow_cluster_data
WHERE name not in ('abc','cdf','fgh')
AND name not like '%operator%'
AND type != 'counter'
You can create you own UDF for transformation and use it in Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html

ignore malformated json from AWS Athena query

I am trying to get the data of Bogota from OpenAQ in AWS Athena, using the following query.
SELECT *
FROM openaq
WHERE city='Bogota'
I get following referring to a malformed JSON.
Row is not a valid JSON Object - JSONException: Unterminated string at 201 [character 202 line 1]
Is there a way to ignore the rows with malformed JSONs in the query?
I have tried adjusting the table template from this source using this line of code as per this thread.
'ignore.malformed.json'='true'
So the new table template query reads:
CREATE EXTERNAL TABLE `openaq`(
`date` struct<utc:string,local:string> COMMENT 'from deserializer',
`parameter` string COMMENT 'from deserializer',
`location` string COMMENT 'from deserializer',
`value` float COMMENT 'from deserializer',
`unit` string COMMENT 'from deserializer',
`city` string COMMENT 'from deserializer',
`attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer',
`averagingperiod` struct<unit:string,value:float> COMMENT 'from deserializer',
`coordinates` struct<latitude:float,longitude:float> COMMENT 'from deserializer',
`country` string COMMENT 'from deserializer',
`sourcename` string COMMENT 'from deserializer',
`sourcetype` string COMMENT 'from deserializer',
`mobile` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json'='true')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://openaq-fetches/realtime-gzipped'
TBLPROPERTIES (
'transient_lastDdlTime'='1518373755')
This seems to solve the error for me.

Athena returns returning zero result

Hi I am creating a table like -
CREATE EXTERNAL TABLE `historyrecordjson`(
`last_name` string COMMENT 'from deserializer',
`first_name` string COMMENT 'from deserializer',
`email` string COMMENT 'from deserializer',
`country` string COMMENT 'from deserializer',
`city` string COMMENT 'from deserializer',
`event_time` bigint COMMENT 'from deserializer'
)
PARTITIONED BY (
`account_id` string,
`year` string,
`month` string,
`day` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://aguptahistoryrecordcopy/recordshistoryjson/'
TBLPROPERTIES (
'projection.account_id.type'='injected',
'projection.day.range'='01,31',
'projection.day.type'='integer',
'projection.enabled'='true',
'projection.month.range'='01,12',
'projection.month.type'='integer',
'projection.year.range'='2020,3000',
'projection.year.type'='integer',
'storage.location.template'='s3://aguptahistoryrecordcopy/historyrecordjson/${account_id}/${year}/${month}/${day}')
When I am running below query, it is returning zero record-
SELECT * FROM "historyrecordjson" where account_id='acc-1234' AND year= '2021' AND month= '1' AND day='1' limit 10 ;
My S3 directory looks like-
s3://aguptahistoryrecordcopy/historyrecordjson/account_id=acc-1234/year=2021/month=1/day=1/1b339139-326c-432f-90aa-15bf30f37be2.json
I can see that partition is getting loaded as -
account_id=acc-1234/year=2021/month=1/day=1
I am not sure what am I missing. I see in query result that Data scanned: 0 KB
The DDL that you are using is for a text delimited file where as your actual data in S3 is JSON data. Refer to https://github.com/rcongiu/Hive-JSON-Serde and create table with correct SerDe and definition for JSOn data.

Athena displays special characters as?

I have an external table with below DDL
CREATE EXTERNAL TABLE `table_1`(
`name` string COMMENT 'from deserializer',
`desc1` string COMMENT 'from deserializer',
`desc2` string COMMENT 'from deserializer',
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'='|',
'skip.header.line.count'='1')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://temp_loc/temp_csv/'
TBLPROPERTIES (
'classification'='csv',
'compressionType'='none',
'typeOfData'='file')
The csv files that this table reads are UTF-16 LE encoded when trying to render the output using Athena the special characters are being displayed as question marks in the output. Is there any way to set encoding in Athena or to fix this.
The solution, as Philipp Johannis mentions in a comment, is to set the serialization.encoding table property to "UTF-16LE". As far as I can see LazySimpleSerde uses java.nio.charset.Charset.forName, so any encoding/charset name accepted by Java should work.

Athena Issue - query string in lowercase

I've table that contains JSON column_A.
Instead of setting column_A to a struct, I set column_A as a string to query JSON.
The problem is when I query column_A I receive the data in lowercase.
CREATE EXTERNAL TABLE `table_test`(
`userid` string COMMENT 'from deserializer',
`column_a` string COMMENT 'from deserializer',
`createdat` string COMMENT 'from deserializer',
`updatedat` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='column_a,userId')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://test-stackoverflow/'
TBLPROPERTIES (
'classification'='json',
'transient_lastDdlTime'='1567353697')
I understand that related to SerDe but I don't know what should I change to.
What can I do to solve it?
The reasons I don't set column_A as struct are:
1. The key is changing every time and as far as I know, I need to set the key values when I define the struct.
2. I've empty string and key - Got errors for query struct with "" as key.
Thanks.
Found the solution.
The problem is with SerDe
https://docs.aws.amazon.com/athena/latest/ug/json.html
need to add 'case.insensitive'= "FALSE" to Serde properties.
CREATE EXTERNAL TABLE `table_test`(
`userid` string COMMENT 'from deserializer',
`column_a` string COMMENT 'from deserializer',
`createdat` string COMMENT 'from deserializer',
`updatedat` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'case.insensitive'= "FALSE",
'paths'='column_a,userId')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://test-stackoverflow/'
TBLPROPERTIES (
'classification'='json',
'transient_lastDdlTime'='1567353697')
Hope this will help someone.