Type conversion on the fly - amazon-athena

I've got CSV file (with aws billing info) where each field is stored as a string in quotes, like that: "value"
So part of the sample line looks as follows:
"234234324223532","First 3 Dashboards per month are free.","2018-08-01 00:00:00","2018-08-01 01:00:00","0.0026881720"
When I define new table as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
Id INT,
Desc STRING,
StartTime TIMESTAMP,
EndTime TIMESTAMP,
Cost DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://MYBUCKET/FOLDER/'
I can only see Desc values in results of Select * from mydb.mytable
Is it possible to define some converters in create table statement?
Or do I need to remove most of the quotation marks (") from source files? That's very undesirable.

The problem you are having is Athena is considering all content as string. If you define all columns as string you should be able to see all the content.
You can try using a Serde where you can define the quote char so the data types could be accepted:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
Id INT,
Desc STRING,
StartTime TIMESTAMP,
EndTime TIMESTAMP,
Cost DOUBLE
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'=',')
LOCATION 's3://MYBUCKET/FOLDER/'
I hope this helps.

Related

AWS Athena identify malformed CSVs

I am trying to combine CSVs which I created and put into S3 into a single CSV file. I am using AWS Athena queries to do that, but I'm getting errors like
HIVE_BAD_DATA: Error parsing column '0': For input string: apples</li><li>Item 1</li><li>Item 2</li><div class=""atlanta"">organges</div> ...
How do I solve this?
This is my code - the description field/column contains HTML.
Creating table:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_table` (
`item_no` double,
`description` string,
`name` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://my_location/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');
Querying table:
select * from my_table
where "item_no" < 30000
limit 10

How to transform data in Amazon Athena

I have some data in S3 location in json format. It have 4 columns val, time__stamp, name and type. I would like to create an external Athena table from this data with some transformations given below:
timestamp: timestamp should be converted from unix epoch to UTC, this I did by using the timestamp data type.
name: name should filtered with following sql logic:
name not in ('abc','cdf','fgh') and name not like '%operator%'
type: type should not have values labeled as counter
I would like to add two partition columns date and hour which should be derived from time__stamp column
I started with following:
CREATE EXTERNAL TABLE `airflow_cluster_data`(
`val` string COMMENT 'from deserializer',
`time__stamp` timestamp COMMENT 'from deserializer',
`name` string COMMENT 'from deserializer',
`type` string COMMENT 'from deserializer')
PARTITIONED BY (
date,
hour)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.time_stamp'='#timestamp')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket1/raw/airflow_data'
I tried various things but couldn't figure out the syntax. Using spark could have been easier but I don't want to run Amazon EMR every hour for small data set. I prefer to do it in Athena if possible.
Please have a look at some sample data:
1533,1636674330000,abc,counter
1533,1636674330000,xyz,timer
1,1636674330000,cde,counter
41,1636674330000,cde,timer
1,1636674330000,fgh,counter
231,1636674330000,xyz,timer
1,1636674330000,abc,counter
2431,1636674330000,cde,counter
42,1636674330000,efg,timer
Probably the simplest method is to create a View:
CREATE VIEW foo AS
SELECT
val,
cast(from_unixtime(time__stamp / 1000) as timestamp) as timestamp,
cast(from_unixtime(time__stamp / 1000) as date) as date,
hour(cast(from_unixtime(time__stamp / 1000) as timestamp)) as hour,
name,
type
FROM airflow_cluster_data
WHERE name not in ('abc','cdf','fgh')
AND name not like '%operator%'
AND type != 'counter'
You can create you own UDF for transformation and use it in Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html

Create external table from csv file in AWS Athena

I am trying to create an external table in AWS Athena from a csv file that is stored in my S3.
The csv file looks as follows. As you can see, the data is not enclosed in quotation marks (") and is delimited by commas (,).
ID,PERSON_ID,DATECOL,GMAT
612766604,54723367,2020-01-15,637
615921503,158634997,2020-01-25,607
610656030,90359154,2020-01-07,670
I tried the following code to create a table:
CREATE EXTERNAL TABLE my_table
(
ID string,
PERSON_ID int,
DATE_COL date,
GMAT int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://my_bucket/som_bucket/dat/'
TBLPROPERTIES
(
'skip.header.line.count'='1'
)
;
I tried to preview the table with the following code:
select
*
from
my_table
limit 10
Which raises this error:
HIVE_BAD_DATA: Error parsing field value '2020-01-15' for field 2: For input string: "2020-01-15"
My question is: Am I passing the correct serde? And if so, how can I format the date column (DATE_COL) such that it reads and displays days in YYYY-MM-DD?
I replaced ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with
FIELDS TERMINATED BY ',' and enclosed the column names with "`". The following code creates the table correctly:
CREATE EXTERNAL TABLE my_table
(
`ID` string,
`PERSON_ID` int,
`DATE_COL` date,
`GMAT` int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://my_bucket/som_bucket/dat/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
I do not understand the concept of a serde, but I suppose I did not need one to begin with.
Per documentation, a column with type DATE must have a values representing the number of days since January 1, 1970. For example, the date on row 1 after your header should have a value of 18276. When the table is queried the date will then be rendered as 2020-01-15.

Remove double quotes " while loading data to Amazon Redshift Spectrum

I want to load data to amazon redshift external table. Data is in CSV format and has quotes.
Do we have something like REMOVEQUOTES which we have in copy command for redshift external
tables. Also what are different options to load fixed length data in external table.
To create an external Spectrum table, you should reference the CREATE TABLE syntax provided by Athena. To load a CSV escaped by double quotes, you should use the following lines as your ROW FORMAT
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
For fixed length files, you should use the RegexSerDe. In this case, the relevant portion of your CREATE TABLE statement will look like this (assuming 3 fields of length 100).
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "(.{100})(.{100})(.{100})")
You can also use regex to parse data enclosed by multiple characters. Example (in CSV file, fields were surrounded by triple double-quotes (""")):
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.AbstractSerDe'
WITH SERDEPROPERTIES (
'input.regex' = "^\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*$" )
)

Amazon Athena: Split line separated by |

I have log files where each line has the format:
key1=val1|key2=val2|key3=val3
How do I make Amazon Athena split this into columns key1, key2 and key3?
You can create a table based on Regex. This way you can define the parsing scheme for your table.
For you sample, the DDL would look like this.
CREATE EXTERNAL TABLE IF NOT EXISTS test (
key1 string,
key2 string,
key3 string
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^key1=([^\\|]+)\\|key2=([^\\|]+)\\|key3=([^\\|]+)$"
) LOCATION 's3://njams-data/test/';