Amazon Athena - Creating external table and ignoring commas enclosed in quotes - amazon-web-services

I'm trying to create an table in Athena via the AWS CLI. My file has string fields enclosed in quotes. my understanding is that I need to set the serdeproperties to take care of this. The below command doesn't work and I assume it has something to do with the way the way I handled escaping the quote in serdeproperties.
Doesn't work:
aws athena start-query-execution
--query-string "
create external table tbl1 (val1 INT, val2 STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',', 'quoteChar' = '\"', 'escapeChar' = '\\')
LOCATION 's3://bucketpath/folder'
TBLPROPERTIES ('skip.header.line.count'='1');"
--result-configuration "OutputLocation=s3://bucketpath/Output/"
Works:
aws athena start-query-execution
--query-string "
create external table tbl1 (val1 INT, val2 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://bucketpath/folder'
TBLPROPERTIES ('skip.header.line.count'='1');"
--result-configuration "OutputLocation=s3://bucketpath/Output/"
Any thoughts?

Related

AWS Athena identify malformed CSVs

I am trying to combine CSVs which I created and put into S3 into a single CSV file. I am using AWS Athena queries to do that, but I'm getting errors like
HIVE_BAD_DATA: Error parsing column '0': For input string: apples</li><li>Item 1</li><li>Item 2</li><div class=""atlanta"">organges</div> ...
How do I solve this?
This is my code - the description field/column contains HTML.
Creating table:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_table` (
`item_no` double,
`description` string,
`name` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://my_location/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');
Querying table:
select * from my_table
where "item_no" < 30000
limit 10

Create automatic partition for S3 year/month/day/hour folders

My S3 bucket has the following structure.
's3://st.pix/year/month/day/hour' for example
's3://st.pix/2022/09/01/06'
So I tried to create a partition table on this bucket using this code:
CREATE EXTERNAL TABLE IF NOT EXISTS `acco`.`Accesn` (
`ad_id` string,
)
PARTITIONED BY (year string, month string, day string , hour string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://st.pix/${year}/${month}/${day}/${hour}/'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip');
and right after run
MSCK REPAIR TABLE Accesn
But unfortunately, this query gets no result.
SELECT count(*) FROM `acco`.`Accesn` where year ='2022' and month= '03' and day ='01' and hour ='01'
Can I use ${year}/${month}/${day}/${hour}/ in my LOCATION ?
If no, what are the options to do it dynamically and not using ALTER TABLE .. ADD PARTITION for a specific partition.

AWS Athena null values are replaced by N after table is created. How to keep them as it is?

I'm creating a table in Athena from csv data in S3. The data has some columns quoted, so I use:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
'serialization.null.format' = '')
The serde works fine but then the null values in the resultant table are replaced with N. How can I keep the null values as empty or like Null etc, but not as N.
Thanks.

Remove double quotes " while loading data to Amazon Redshift Spectrum

I want to load data to amazon redshift external table. Data is in CSV format and has quotes.
Do we have something like REMOVEQUOTES which we have in copy command for redshift external
tables. Also what are different options to load fixed length data in external table.
To create an external Spectrum table, you should reference the CREATE TABLE syntax provided by Athena. To load a CSV escaped by double quotes, you should use the following lines as your ROW FORMAT
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
For fixed length files, you should use the RegexSerDe. In this case, the relevant portion of your CREATE TABLE statement will look like this (assuming 3 fields of length 100).
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "(.{100})(.{100})(.{100})")
You can also use regex to parse data enclosed by multiple characters. Example (in CSV file, fields were surrounded by triple double-quotes (""")):
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.AbstractSerDe'
WITH SERDEPROPERTIES (
'input.regex' = "^\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*$" )
)

Amazon Athena: Split line separated by |

I have log files where each line has the format:
key1=val1|key2=val2|key3=val3
How do I make Amazon Athena split this into columns key1, key2 and key3?
You can create a table based on Regex. This way you can define the parsing scheme for your table.
For you sample, the DDL would look like this.
CREATE EXTERNAL TABLE IF NOT EXISTS test (
key1 string,
key2 string,
key3 string
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^key1=([^\\|]+)\\|key2=([^\\|]+)\\|key3=([^\\|]+)$"
) LOCATION 's3://njams-data/test/';