I'm using Athena to write some gzip files to S3.
Query
CREATE TABLE NODES_GZIPPED_NODESTEST5
WITH (
external_location = 'my-bucket',
format = 'TEXTFILE',
)
AS SELECT col1, col2
FROM ExistingTableIHave
LIMIT 10;
The table is just 2 columns, but when I create this table and check the external_location, the files are missing the comma delimiter between the data. How can I ensure the CSVs it writes to S3 keep the commas?
You can add a field_delimiter to the WITH expression.
From the AWS docs:
Optional and specific to text-based data storage formats. The single-character field delimiter for files in CSV, TSV, and text files. For example, WITH (field_delimiter = ','). Currently, multicharacter field delimiters are not supported for CTAS queries. If you don't specify a field delimiter, \001 is used by default.
Related
So I have two hive queries, one that creates the table and the other one that reads parquet data from another table and inserts the relevant columns into my new table. I would like this new hive table to export its data to an s3 location with data in csv.gz format. My hive queries running on emr are currently outputting 00000_0.gz and I have to rename them using a bash script to csv.gz. This is quite a hacky way as I have to mount my s3 directory into my terminal and it would be ideal if my queries could directly do this. Could someone please review my queries to see where if there's any fault, many thanks.
CREATE TABLE db.test (
app_id string,
app_account_id string,
sdk_ts BIGINT,
device_id string)
PARTITIONED BY (
load_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION "s3://test_unload/";
set hive.execution.engine=tez;
set hive.cli.print.header=true;
set hive.exec.compress.output=true;
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=1024000000;
set hive.merge.size.per.task=1024000000;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into db.test
partition(load_date)
select
'' as app_id,
'288' as app_account_id,
from_unixtime(CAST(event_epoch as BIGINT), 'yyyy-MM-dd HH:mm:ss') as sdk_ts,
device_id,
'20221106' as load_date
FROM processed_events.test
where load_date = '20221106'; ```
I am generating a gz file that is csv by using below query
CREATE TABLE newgziptable4
WITH (
format = 'TEXTFILE',
write_compression = 'GZIP',
field_delimiter = ',',
external_location = 's3://bucket1/reporting/gzipoutputs4'
) AS
select name , birthdate from "myathena_table";
I am following this link on aws website
The issue is that if i just generate CSV, I see the column names as the first row in output csv. But when I use the above method, I do not see the column names name and birthdate. How can I ensure that I get those as well in the gz
Getting timeout error for a full text query in Athena like this...
SELECT count(textbody) FROM "email"."some_table" where textbody like '% some text to seach%'
Is there any way to optimize it?
Update:
The create table statement:
CREATE EXTERNAL TABLE `email`.`email5_newsletters_04032019`(
`nesletterid` string,
`name` string,
`format` string,
`subject` string,
`textbody` string,
`htmlbody` string,
`createdate` string,
`active` string,
`archive` string,
`ownerid` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'ESCAPED BY' = '\\'
) LOCATION 's3://some_bucket/email_backup_updated/email5/'
TBLPROPERTIES ('has_encrypted_data'='false');
And S3 bucket contents:
# aws s3 ls s3://xxx/email_backup_updated/email5/ --human
2020-08-22 15:34:44 2.2 GiB email_newsletters_04032019_updated.csv.gz
There are 11 million records in this file. The file can be imported within 30 minutes in Redshift and everything works OK in redshift. I will prefer to use Athena!
CSV is not a format that integrates very well with the presto engine, as queries need to read the full row to reach a single column. A way to optimize usage of athena, which will also save you plenty of storage costs, is to switch to a columnar storage format, like parquet or orc, and you can actually do it with a query:
CREATE TABLE `email`.`email5_newsletters_04032019_orc`
WITH (
external_location = 's3://my_orc_table/',
format = 'ORC')
AS SELECT *
FROM `email`.`email5_newsletters_04032019`;
Then rerun your query above on the new table:
SELECT count(textbody) FROM "email"."email5_newsletters_04032019_orc" where textbody like '% some text to seach%'
I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?
Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV
Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV
This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/
I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.