Athena displays special characters as? - amazon-web-services

I have an external table with below DDL
CREATE EXTERNAL TABLE `table_1`(
`name` string COMMENT 'from deserializer',
`desc1` string COMMENT 'from deserializer',
`desc2` string COMMENT 'from deserializer',
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'='|',
'skip.header.line.count'='1')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://temp_loc/temp_csv/'
TBLPROPERTIES (
'classification'='csv',
'compressionType'='none',
'typeOfData'='file')
The csv files that this table reads are UTF-16 LE encoded when trying to render the output using Athena the special characters are being displayed as question marks in the output. Is there any way to set encoding in Athena or to fix this.

The solution, as Philipp Johannis mentions in a comment, is to set the serialization.encoding table property to "UTF-16LE". As far as I can see LazySimpleSerde uses java.nio.charset.Charset.forName, so any encoding/charset name accepted by Java should work.

Related

How to transform data in Amazon Athena

I have some data in S3 location in json format. It have 4 columns val, time__stamp, name and type. I would like to create an external Athena table from this data with some transformations given below:
timestamp: timestamp should be converted from unix epoch to UTC, this I did by using the timestamp data type.
name: name should filtered with following sql logic:
name not in ('abc','cdf','fgh') and name not like '%operator%'
type: type should not have values labeled as counter
I would like to add two partition columns date and hour which should be derived from time__stamp column
I started with following:
CREATE EXTERNAL TABLE `airflow_cluster_data`(
`val` string COMMENT 'from deserializer',
`time__stamp` timestamp COMMENT 'from deserializer',
`name` string COMMENT 'from deserializer',
`type` string COMMENT 'from deserializer')
PARTITIONED BY (
date,
hour)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.time_stamp'='#timestamp')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket1/raw/airflow_data'
I tried various things but couldn't figure out the syntax. Using spark could have been easier but I don't want to run Amazon EMR every hour for small data set. I prefer to do it in Athena if possible.
Please have a look at some sample data:
1533,1636674330000,abc,counter
1533,1636674330000,xyz,timer
1,1636674330000,cde,counter
41,1636674330000,cde,timer
1,1636674330000,fgh,counter
231,1636674330000,xyz,timer
1,1636674330000,abc,counter
2431,1636674330000,cde,counter
42,1636674330000,efg,timer
Probably the simplest method is to create a View:
CREATE VIEW foo AS
SELECT
val,
cast(from_unixtime(time__stamp / 1000) as timestamp) as timestamp,
cast(from_unixtime(time__stamp / 1000) as date) as date,
hour(cast(from_unixtime(time__stamp / 1000) as timestamp)) as hour,
name,
type
FROM airflow_cluster_data
WHERE name not in ('abc','cdf','fgh')
AND name not like '%operator%'
AND type != 'counter'
You can create you own UDF for transformation and use it in Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html

ignore malformated json from AWS Athena query

I am trying to get the data of Bogota from OpenAQ in AWS Athena, using the following query.
SELECT *
FROM openaq
WHERE city='Bogota'
I get following referring to a malformed JSON.
Row is not a valid JSON Object - JSONException: Unterminated string at 201 [character 202 line 1]
Is there a way to ignore the rows with malformed JSONs in the query?
I have tried adjusting the table template from this source using this line of code as per this thread.
'ignore.malformed.json'='true'
So the new table template query reads:
CREATE EXTERNAL TABLE `openaq`(
`date` struct<utc:string,local:string> COMMENT 'from deserializer',
`parameter` string COMMENT 'from deserializer',
`location` string COMMENT 'from deserializer',
`value` float COMMENT 'from deserializer',
`unit` string COMMENT 'from deserializer',
`city` string COMMENT 'from deserializer',
`attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer',
`averagingperiod` struct<unit:string,value:float> COMMENT 'from deserializer',
`coordinates` struct<latitude:float,longitude:float> COMMENT 'from deserializer',
`country` string COMMENT 'from deserializer',
`sourcename` string COMMENT 'from deserializer',
`sourcetype` string COMMENT 'from deserializer',
`mobile` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json'='true')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://openaq-fetches/realtime-gzipped'
TBLPROPERTIES (
'transient_lastDdlTime'='1518373755')
This seems to solve the error for me.

Athena returns returning zero result

Hi I am creating a table like -
CREATE EXTERNAL TABLE `historyrecordjson`(
`last_name` string COMMENT 'from deserializer',
`first_name` string COMMENT 'from deserializer',
`email` string COMMENT 'from deserializer',
`country` string COMMENT 'from deserializer',
`city` string COMMENT 'from deserializer',
`event_time` bigint COMMENT 'from deserializer'
)
PARTITIONED BY (
`account_id` string,
`year` string,
`month` string,
`day` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://aguptahistoryrecordcopy/recordshistoryjson/'
TBLPROPERTIES (
'projection.account_id.type'='injected',
'projection.day.range'='01,31',
'projection.day.type'='integer',
'projection.enabled'='true',
'projection.month.range'='01,12',
'projection.month.type'='integer',
'projection.year.range'='2020,3000',
'projection.year.type'='integer',
'storage.location.template'='s3://aguptahistoryrecordcopy/historyrecordjson/${account_id}/${year}/${month}/${day}')
When I am running below query, it is returning zero record-
SELECT * FROM "historyrecordjson" where account_id='acc-1234' AND year= '2021' AND month= '1' AND day='1' limit 10 ;
My S3 directory looks like-
s3://aguptahistoryrecordcopy/historyrecordjson/account_id=acc-1234/year=2021/month=1/day=1/1b339139-326c-432f-90aa-15bf30f37be2.json
I can see that partition is getting loaded as -
account_id=acc-1234/year=2021/month=1/day=1
I am not sure what am I missing. I see in query result that Data scanned: 0 KB
The DDL that you are using is for a text delimited file where as your actual data in S3 is JSON data. Refer to https://github.com/rcongiu/Hive-JSON-Serde and create table with correct SerDe and definition for JSOn data.

create bucket table in AWS Athena

I tried below query to create a bucket table but failed. However, if I remove
the clause CLUSTERED BY, the query could succeed. Any suggestion? Thank you.
the error message: no viable alternative at input create external
CREATE EXTERNAL TABLE nation5(
n_nationkey bigint,
n_name string,
n_rgionkey int,
n_comment string)
CLUSTERED BY
n_regionkey INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='s3://test/testbucket/nation5/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/testbucket/nation5/'
The CLUSTERED BY column needs to be in brackets, the following works:
CREATE EXTERNAL TABLE test.nation5(
n_nationkey bigint,
n_name string,
n_regionkey int,
n_comment string)
CLUSTERED BY
(n_regionkey) INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='s3://test/testbucket/nation5/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/testbucket/nation5/'
(You also have a spelling mistake in n_rgionkey in the column definition.

Athena Issue - query string in lowercase

I've table that contains JSON column_A.
Instead of setting column_A to a struct, I set column_A as a string to query JSON.
The problem is when I query column_A I receive the data in lowercase.
CREATE EXTERNAL TABLE `table_test`(
`userid` string COMMENT 'from deserializer',
`column_a` string COMMENT 'from deserializer',
`createdat` string COMMENT 'from deserializer',
`updatedat` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='column_a,userId')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://test-stackoverflow/'
TBLPROPERTIES (
'classification'='json',
'transient_lastDdlTime'='1567353697')
I understand that related to SerDe but I don't know what should I change to.
What can I do to solve it?
The reasons I don't set column_A as struct are:
1. The key is changing every time and as far as I know, I need to set the key values when I define the struct.
2. I've empty string and key - Got errors for query struct with "" as key.
Thanks.
Found the solution.
The problem is with SerDe
https://docs.aws.amazon.com/athena/latest/ug/json.html
need to add 'case.insensitive'= "FALSE" to Serde properties.
CREATE EXTERNAL TABLE `table_test`(
`userid` string COMMENT 'from deserializer',
`column_a` string COMMENT 'from deserializer',
`createdat` string COMMENT 'from deserializer',
`updatedat` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'case.insensitive'= "FALSE",
'paths'='column_a,userId')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://test-stackoverflow/'
TBLPROPERTIES (
'classification'='json',
'transient_lastDdlTime'='1567353697')
Hope this will help someone.