I want to load data to amazon redshift external table. Data is in CSV format and has quotes.
Do we have something like REMOVEQUOTES which we have in copy command for redshift external
tables. Also what are different options to load fixed length data in external table.
To create an external Spectrum table, you should reference the CREATE TABLE syntax provided by Athena. To load a CSV escaped by double quotes, you should use the following lines as your ROW FORMAT
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
For fixed length files, you should use the RegexSerDe. In this case, the relevant portion of your CREATE TABLE statement will look like this (assuming 3 fields of length 100).
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "(.{100})(.{100})(.{100})")
You can also use regex to parse data enclosed by multiple characters. Example (in CSV file, fields were surrounded by triple double-quotes (""")):
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.AbstractSerDe'
WITH SERDEPROPERTIES (
'input.regex' = "^\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*,\"*([^\"]*)\"*$" )
)
Related
I am trying to combine CSVs which I created and put into S3 into a single CSV file. I am using AWS Athena queries to do that, but I'm getting errors like
HIVE_BAD_DATA: Error parsing column '0': For input string: apples</li><li>Item 1</li><li>Item 2</li><div class=""atlanta"">organges</div> ...
How do I solve this?
This is my code - the description field/column contains HTML.
Creating table:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_table` (
`item_no` double,
`description` string,
`name` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://my_location/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');
Querying table:
select * from my_table
where "item_no" < 30000
limit 10
I'm creating a table in Athena from csv data in S3. The data has some columns quoted, so I use:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
'serialization.null.format' = '')
The serde works fine but then the null values in the resultant table are replaced with N. How can I keep the null values as empty or like Null etc, but not as N.
Thanks.
I'm trying to create an table in Athena via the AWS CLI. My file has string fields enclosed in quotes. my understanding is that I need to set the serdeproperties to take care of this. The below command doesn't work and I assume it has something to do with the way the way I handled escaping the quote in serdeproperties.
Doesn't work:
aws athena start-query-execution
--query-string "
create external table tbl1 (val1 INT, val2 STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',', 'quoteChar' = '\"', 'escapeChar' = '\\')
LOCATION 's3://bucketpath/folder'
TBLPROPERTIES ('skip.header.line.count'='1');"
--result-configuration "OutputLocation=s3://bucketpath/Output/"
Works:
aws athena start-query-execution
--query-string "
create external table tbl1 (val1 INT, val2 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://bucketpath/folder'
TBLPROPERTIES ('skip.header.line.count'='1');"
--result-configuration "OutputLocation=s3://bucketpath/Output/"
Any thoughts?
I've got CSV file (with aws billing info) where each field is stored as a string in quotes, like that: "value"
So part of the sample line looks as follows:
"234234324223532","First 3 Dashboards per month are free.","2018-08-01 00:00:00","2018-08-01 01:00:00","0.0026881720"
When I define new table as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
Id INT,
Desc STRING,
StartTime TIMESTAMP,
EndTime TIMESTAMP,
Cost DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://MYBUCKET/FOLDER/'
I can only see Desc values in results of Select * from mydb.mytable
Is it possible to define some converters in create table statement?
Or do I need to remove most of the quotation marks (") from source files? That's very undesirable.
The problem you are having is Athena is considering all content as string. If you define all columns as string you should be able to see all the content.
You can try using a Serde where you can define the quote char so the data types could be accepted:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
Id INT,
Desc STRING,
StartTime TIMESTAMP,
EndTime TIMESTAMP,
Cost DOUBLE
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'=',')
LOCATION 's3://MYBUCKET/FOLDER/'
I hope this helps.
I have log files where each line has the format:
key1=val1|key2=val2|key3=val3
How do I make Amazon Athena split this into columns key1, key2 and key3?
You can create a table based on Regex. This way you can define the parsing scheme for your table.
For you sample, the DDL would look like this.
CREATE EXTERNAL TABLE IF NOT EXISTS test (
key1 string,
key2 string,
key3 string
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^key1=([^\\|]+)\\|key2=([^\\|]+)\\|key3=([^\\|]+)$"
) LOCATION 's3://njams-data/test/';