AWS athena query gzip format, missing column name in first row - amazon-web-services

I am generating a gz file that is csv by using below query
CREATE TABLE newgziptable4
WITH (
format = 'TEXTFILE',
write_compression = 'GZIP',
field_delimiter = ',',
external_location = 's3://bucket1/reporting/gzipoutputs4'
) AS
select name , birthdate from "myathena_table";
I am following this link on aws website
The issue is that if i just generate CSV, I see the column names as the first row in output csv. But when I use the above method, I do not see the column names name and birthdate. How can I ensure that I get those as well in the gz

Related

Convert result of Athena query to S3 multipart files

I am running Athena query and trying to get a difference between two tables as below:
CREATE TABLE TableName
WITH (
format = 'TEXTFILE',
write.compression = 'NONE',
external_location = s3Location)
AS
SELECT * FROM currentTable
EXCEPT
SELECT * FROM previousTable;
I would like to store the result of query into specified s3Location as S3 multipart file (file with extension: .0000_part_00) and as TSV file format. How can I achieve it?
I am trying to do it programmatically in Java.

Athena AWS is created empty table

In my following S3 bucket I've gz file without a header that contains one column
In Athena editor, I run the following statement
CREATE EXTERNAL TABLE IF NOT EXISTS `access_file_o`.`Access_one` (
`ad_id` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://ttt.pix/2022/01/01/00/rrrf.log.1-2022_01_01_00_00_06_316845229-i-06877974d15a00d7e.gz/'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip');
The file looks like that
111,
222,
222,
3333,
The table has been created but when I query this table
select * from "Access_one"
there are no rows, only columns name.
Please advice
The location should be folder and not file
This URI working well
s3://ttt.pix/2022/01/01/00/
While this one returns an empty table.
LOCATION 's3://ttt.pix/2022/01/01/00/rrrf.log.1-2022_01_01_00_00_06_316845229-i-06877974d15a00d7e.gz

Athena removing delimiter in output CSV files

I'm using Athena to write some gzip files to S3.
Query
CREATE TABLE NODES_GZIPPED_NODESTEST5
WITH (
external_location = 'my-bucket',
format = 'TEXTFILE',
)
AS SELECT col1, col2
FROM ExistingTableIHave
LIMIT 10;
The table is just 2 columns, but when I create this table and check the external_location, the files are missing the comma delimiter between the data. How can I ensure the CSVs it writes to S3 keep the commas?
You can add a field_delimiter to the WITH expression.
From the AWS docs:
Optional and specific to text-based data storage formats. The single-character field delimiter for files in CSV, TSV, and text files. For example, WITH (field_delimiter = ','). Currently, multicharacter field delimiters are not supported for CTAS queries. If you don't specify a field delimiter, \001 is used by default.

Full text query in Amazon Athena is timing-out when using `LIKE`

Getting timeout error for a full text query in Athena like this...
SELECT count(textbody) FROM "email"."some_table" where textbody like '% some text to seach%'
Is there any way to optimize it?
Update:
The create table statement:
CREATE EXTERNAL TABLE `email`.`email5_newsletters_04032019`(
`nesletterid` string,
`name` string,
`format` string,
`subject` string,
`textbody` string,
`htmlbody` string,
`createdate` string,
`active` string,
`archive` string,
`ownerid` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'ESCAPED BY' = '\\'
) LOCATION 's3://some_bucket/email_backup_updated/email5/'
TBLPROPERTIES ('has_encrypted_data'='false');
And S3 bucket contents:
# aws s3 ls s3://xxx/email_backup_updated/email5/ --human
2020-08-22 15:34:44 2.2 GiB email_newsletters_04032019_updated.csv.gz
There are 11 million records in this file. The file can be imported within 30 minutes in Redshift and everything works OK in redshift. I will prefer to use Athena!
CSV is not a format that integrates very well with the presto engine, as queries need to read the full row to reach a single column. A way to optimize usage of athena, which will also save you plenty of storage costs, is to switch to a columnar storage format, like parquet or orc, and you can actually do it with a query:
CREATE TABLE `email`.`email5_newsletters_04032019_orc`
WITH (
external_location = 's3://my_orc_table/',
format = 'ORC')
AS SELECT *
FROM `email`.`email5_newsletters_04032019`;
Then rerun your query above on the new table:
SELECT count(textbody) FROM "email"."email5_newsletters_04032019_orc" where textbody like '% some text to seach%'

Amazon Athena : How to store results after querying with skipping column headers?

I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.