Create external table from csv file in AWS Athena - amazon-web-services

I am trying to create an external table in AWS Athena from a csv file that is stored in my S3.
The csv file looks as follows. As you can see, the data is not enclosed in quotation marks (") and is delimited by commas (,).
ID,PERSON_ID,DATECOL,GMAT
612766604,54723367,2020-01-15,637
615921503,158634997,2020-01-25,607
610656030,90359154,2020-01-07,670
I tried the following code to create a table:
CREATE EXTERNAL TABLE my_table
(
ID string,
PERSON_ID int,
DATE_COL date,
GMAT int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://my_bucket/som_bucket/dat/'
TBLPROPERTIES
(
'skip.header.line.count'='1'
)
;
I tried to preview the table with the following code:
select
*
from
my_table
limit 10
Which raises this error:
HIVE_BAD_DATA: Error parsing field value '2020-01-15' for field 2: For input string: "2020-01-15"
My question is: Am I passing the correct serde? And if so, how can I format the date column (DATE_COL) such that it reads and displays days in YYYY-MM-DD?

I replaced ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with
FIELDS TERMINATED BY ',' and enclosed the column names with "`". The following code creates the table correctly:
CREATE EXTERNAL TABLE my_table
(
`ID` string,
`PERSON_ID` int,
`DATE_COL` date,
`GMAT` int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://my_bucket/som_bucket/dat/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
I do not understand the concept of a serde, but I suppose I did not need one to begin with.

Per documentation, a column with type DATE must have a values representing the number of days since January 1, 1970. For example, the date on row 1 after your header should have a value of 18276. When the table is queried the date will then be rendered as 2020-01-15.

Related

AWS Athena: Partition projection using date-hour with mixed ranges

I am trying to create an Athena table using partition projection. I am delivering records to S3 using Kinesis Firehouse, grouped using a dynamic partitioning key. For example, the records look like the following:
period
item_id
2022/05
monthly_item_1
2022/05/04
daily_item_1
2022/05/04/02
hourly_item_1
2022/06
monthly_item_2
I want to partition the data in S3 by period, which can be monthly, daily or hourly. It is guaranteed that period would be in a supported Java date format. Therefore, I am writing these records to S3 in the below format:
s3://bucket/prefix/2022/05/monthly_items.gz
s3://bucket/prefix/2022/05/04/daily_items.gz
s3://bucket/prefix/2022/05/04/02/hourly_items.gz
s3://bucket/prefix/2022/06/monthly_items.gz
I want to run Athena queries for every partition scope i.e. if my query is for a specific day, I want to fetch its daily_items and hourly_items. If I am running a query for a month, I want to its fetch monthly, daily as well as hourly items.
I've created an Athena table using below query:
create external table `my_table`(
`period` string COMMENT 'from deserializer',
`item_id` string COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix/'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.day.type'='integer',
'projection.day.digits' = '2',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.digits' = '2',
'projection.hour.range'='00,23',
'projection.month.type'='integer',
'projection.month.digits'='02',
'projection.month.range'='01,12',
'projection.year.format'='yyyy',
'projection.year.range'='2022,NOW',
'projection.year.type'='date',
'storage.location.template'='s3://bucket/prefix/${year}/${month}/${day}/${hour}')
However, with this table running below query outputs zero results:
select * from my_table where year = '2022' and month = '06';
I believe the reason is Athena expects all files to be present under the same prefix as defined by storage.location.template. Therefore, any records present under a month or day prefix are not projected.
I was wondering if it was possible to support such querying functionality in a single table with partition projection enabled, when data in S3 is in a folder type structure similar to the examples above.
Would be great if anyone can help me out!

AWS Athena identify malformed CSVs

I am trying to combine CSVs which I created and put into S3 into a single CSV file. I am using AWS Athena queries to do that, but I'm getting errors like
HIVE_BAD_DATA: Error parsing column '0': For input string: apples</li><li>Item 1</li><li>Item 2</li><div class=""atlanta"">organges</div> ...
How do I solve this?
This is my code - the description field/column contains HTML.
Creating table:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_table` (
`item_no` double,
`description` string,
`name` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://my_location/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');
Querying table:
select * from my_table
where "item_no" < 30000
limit 10

AWS Athena null values are replaced by N after table is created. How to keep them as it is?

I'm creating a table in Athena from csv data in S3. The data has some columns quoted, so I use:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
'serialization.null.format' = '')
The serde works fine but then the null values in the resultant table are replaced with N. How can I keep the null values as empty or like Null etc, but not as N.
Thanks.

Type conversion on the fly

I've got CSV file (with aws billing info) where each field is stored as a string in quotes, like that: "value"
So part of the sample line looks as follows:
"234234324223532","First 3 Dashboards per month are free.","2018-08-01 00:00:00","2018-08-01 01:00:00","0.0026881720"
When I define new table as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
Id INT,
Desc STRING,
StartTime TIMESTAMP,
EndTime TIMESTAMP,
Cost DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://MYBUCKET/FOLDER/'
I can only see Desc values in results of Select * from mydb.mytable
Is it possible to define some converters in create table statement?
Or do I need to remove most of the quotation marks (") from source files? That's very undesirable.
The problem you are having is Athena is considering all content as string. If you define all columns as string you should be able to see all the content.
You can try using a Serde where you can define the quote char so the data types could be accepted:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
Id INT,
Desc STRING,
StartTime TIMESTAMP,
EndTime TIMESTAMP,
Cost DOUBLE
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'=',')
LOCATION 's3://MYBUCKET/FOLDER/'
I hope this helps.

Amazon Athena: Split line separated by |

I have log files where each line has the format:
key1=val1|key2=val2|key3=val3
How do I make Amazon Athena split this into columns key1, key2 and key3?
You can create a table based on Regex. This way you can define the parsing scheme for your table.
For you sample, the DDL would look like this.
CREATE EXTERNAL TABLE IF NOT EXISTS test (
key1 string,
key2 string,
key3 string
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^key1=([^\\|]+)\\|key2=([^\\|]+)\\|key3=([^\\|]+)$"
) LOCATION 's3://njams-data/test/';