AWS Athena identify malformed CSVs

AWS Athena identify malformed CSVs - amazon-web-services

I am trying to combine CSVs which I created and put into S3 into a single CSV file. I am using AWS Athena queries to do that, but I'm getting errors like
HIVE_BAD_DATA: Error parsing column '0': For input string: apples</li><li>Item 1</li><li>Item 2</li><div class=""atlanta"">organges</div> ...
How do I solve this?
This is my code - the description field/column contains HTML.
Creating table:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_table` (
`item_no` double,
`description` string,
`name` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://my_location/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');
Querying table:
select * from my_table
where "item_no" < 30000
limit 10

Related

Athena query not populating data correctly

I am new to Athena and wanted to create a table from CSV in S3 bucket with following query:
CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`testdata` (
`col1` varchar(255),
`col2` tinyint,
`col3` varchar(255)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'LINES TERMINATED BY' = '\n',
'ESCAPED BY' = '\\',
'quoteChar' = '\"')
LOCATION 's3://athenaquery/data/'
TBLPROPERTIES ('skip.header.line.count'="1");
It has successfully created table but col3 has multiple lines and when ran select statement col3 second line inserted in column 1 and row 2.
enter image description here

Create automatic partition for S3 year/month/day/hour folders

My S3 bucket has the following structure.
's3://st.pix/year/month/day/hour' for example
's3://st.pix/2022/09/01/06'
So I tried to create a partition table on this bucket using this code:
CREATE EXTERNAL TABLE IF NOT EXISTS `acco`.`Accesn` (
`ad_id` string,
)
PARTITIONED BY (year string, month string, day string , hour string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://st.pix/${year}/${month}/${day}/${hour}/'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip');
and right after run
MSCK REPAIR TABLE Accesn
But unfortunately, this query gets no result.
SELECT count(*) FROM `acco`.`Accesn` where year ='2022' and month= '03' and day ='01' and hour ='01'
Can I use ${year}/${month}/${day}/${hour}/ in my LOCATION ?
If no, what are the options to do it dynamically and not using ALTER TABLE .. ADD PARTITION for a specific partition.

AWS Redshift Spectrum(External table) - Varchar datatype assigned to a column is not able to handle both array and string data for same column

Task: Trying to load a bunch of JSON files from s3 buckets to Redshift using Redshift Spectrum.
Problem: JSON objects in few files are having data wrapped with Square brackets but other JSON files are having the same objects without square brackets. Is there a way to consume data both with/ without square brackets "[ ]" while creating an external table using the Redshift spectrum table?
JSON FILES to be consumed:
File 1: "x":{"y":{ "z":["ABCD"]}}
File 2: "x":{"y":{ "z":"EFGH"}}
Case 1
When column z is defined as an array, I am missing out the data from JSON file which are "without square brackets"
CREATE EXTERNAL TABLE spectrum.table
(x struct<y:struct<z:array<varchar(256)>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'dots.in.keys'='true') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://****'
Query: Select c from spectrum.table t , t.x.y.z c;
Case 2
When column z is defined as varchar (without declaring as an array), below is the error:
Create Statement:
CREATE EXTERNAL TABLE spectrum.table
(x struct<y:struct<z:varchar(256)>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'dots.in.keys'='true') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://****'
Query: Select regexp_replace( t.x.y.z ,'\\([\\"])', '' ) from spectrum.table t;
or Select t.x.y.z from spectrum.table t;
[XX000][500310] [Amazon](500310) Invalid operation: Spectrum Scan Error
Details:
-----------------------------------------------
error: Spectrum Scan Error
code: 15001
context: Unsupported implicit cast: Column ('x' 'y' 'z'), From Type: LIST, To Type: VARCHAR,
----------------------------------------------

Remove double quotes " while loading data to Amazon Redshift Spectrum

I want to load data to amazon redshift external table. Data is in CSV format and has quotes.
Do we have something like REMOVEQUOTES which we have in copy command for redshift external
tables. Also what are different options to load fixed length data in external table.

To create an external Spectrum table, you should reference the CREATE TABLE syntax provided by Athena. To load a CSV escaped by double quotes, you should use the following lines as your ROW FORMAT
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
For fixed length files, you should use the RegexSerDe. In this case, the relevant portion of your CREATE TABLE statement will look like this (assuming 3 fields of length 100).
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "(.{100})(.{100})(.{100})")

You can also use regex to parse data enclosed by multiple characters. Example (in CSV file, fields were surrounded by triple double-quotes (""")):
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.AbstractSerDe'
WITH SERDEPROPERTIES (
'input.regex' = "^\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*$" )
)

Amazon Athena: Split line separated by |

I have log files where each line has the format:
key1=val1|key2=val2|key3=val3
How do I make Amazon Athena split this into columns key1, key2 and key3?

You can create a table based on Regex. This way you can define the parsing scheme for your table.
For you sample, the DDL would look like this.
CREATE EXTERNAL TABLE IF NOT EXISTS test (
key1 string,
key2 string,
key3 string
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^key1=([^\\|]+)\\|key2=([^\\|]+)\\|key3=([^\\|]+)$"
) LOCATION 's3://njams-data/test/';

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Athena identify malformed CSVs - amazon-web-services

Related

Athena query not populating data correctly

Create automatic partition for S3 year/month/day/hour folders

AWS Redshift Spectrum(External table) - Varchar datatype assigned to a column is not able to handle both array and string data for same column

Remove double quotes " while loading data to Amazon Redshift Spectrum

Amazon Athena: Split line separated by |

Categories

Resources