AWS athena query gzip format, missing column name in first row

AWS athena query gzip format, missing column name in first row - amazon-web-services

I am generating a gz file that is csv by using below query
CREATE TABLE newgziptable4
WITH (
format = 'TEXTFILE',
write_compression = 'GZIP',
field_delimiter = ',',
external_location = 's3://bucket1/reporting/gzipoutputs4'
) AS
select name , birthdate from "myathena_table";
I am following this link on aws website
The issue is that if i just generate CSV, I see the column names as the first row in output csv. But when I use the above method, I do not see the column names name and birthdate. How can I ensure that I get those as well in the gz

Related

Convert result of Athena query to S3 multipart files

I am running Athena query and trying to get a difference between two tables as below:
CREATE TABLE TableName
WITH (
format = 'TEXTFILE',
write.compression = 'NONE',
external_location = s3Location)
AS
SELECT * FROM currentTable
EXCEPT
SELECT * FROM previousTable;
I would like to store the result of query into specified s3Location as S3 multipart file (file with extension: .0000_part_00) and as TSV file format. How can I achieve it?
I am trying to do it programmatically in Java.

Athena AWS is created empty table

In my following S3 bucket I've gz file without a header that contains one column
In Athena editor, I run the following statement
CREATE EXTERNAL TABLE IF NOT EXISTS `access_file_o`.`Access_one` (
`ad_id` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://ttt.pix/2022/01/01/00/rrrf.log.1-2022_01_01_00_00_06_316845229-i-06877974d15a00d7e.gz/'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip');
The file looks like that
111,
222,
222,
3333,
The table has been created but when I query this table
select * from "Access_one"
there are no rows, only columns name.
Please advice

The location should be folder and not file
This URI working well
s3://ttt.pix/2022/01/01/00/
While this one returns an empty table.
LOCATION 's3://ttt.pix/2022/01/01/00/rrrf.log.1-2022_01_01_00_00_06_316845229-i-06877974d15a00d7e.gz

Athena removing delimiter in output CSV files

I'm using Athena to write some gzip files to S3.
Query
CREATE TABLE NODES_GZIPPED_NODESTEST5
WITH (
external_location = 'my-bucket',
format = 'TEXTFILE',
)
AS SELECT col1, col2
FROM ExistingTableIHave
LIMIT 10;
The table is just 2 columns, but when I create this table and check the external_location, the files are missing the comma delimiter between the data. How can I ensure the CSVs it writes to S3 keep the commas?

You can add a field_delimiter to the WITH expression.
From the AWS docs:
Optional and specific to text-based data storage formats. The single-character field delimiter for files in CSV, TSV, and text files. For example, WITH (field_delimiter = ','). Currently, multicharacter field delimiters are not supported for CTAS queries. If you don't specify a field delimiter, \001 is used by default.

Full text query in Amazon Athena is timing-out when using `LIKE`

Getting timeout error for a full text query in Athena like this...
SELECT count(textbody) FROM "email"."some_table" where textbody like '% some text to seach%'
Is there any way to optimize it?
Update:
The create table statement:
CREATE EXTERNAL TABLE `email`.`email5_newsletters_04032019`(
`nesletterid` string,
`name` string,
`format` string,
`subject` string,
`textbody` string,
`htmlbody` string,
`createdate` string,
`active` string,
`archive` string,
`ownerid` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'ESCAPED BY' = '\\'
) LOCATION 's3://some_bucket/email_backup_updated/email5/'
TBLPROPERTIES ('has_encrypted_data'='false');
And S3 bucket contents:
# aws s3 ls s3://xxx/email_backup_updated/email5/ --human
2020-08-22 15:34:44 2.2 GiB email_newsletters_04032019_updated.csv.gz
There are 11 million records in this file. The file can be imported within 30 minutes in Redshift and everything works OK in redshift. I will prefer to use Athena!

CSV is not a format that integrates very well with the presto engine, as queries need to read the full row to reach a single column. A way to optimize usage of athena, which will also save you plenty of storage costs, is to switch to a columnar storage format, like parquet or orc, and you can actually do it with a query:
CREATE TABLE `email`.`email5_newsletters_04032019_orc`
WITH (
external_location = 's3://my_orc_table/',
format = 'ORC')
AS SELECT *
FROM `email`.`email5_newsletters_04032019`;
Then rerun your query above on the new table:
SELECT count(textbody) FROM "email"."email5_newsletters_04032019_orc" where textbody like '% some text to seach%'

Amazon Athena : How to store results after querying with skipping column headers?

I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive

Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row

From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS athena query gzip format, missing column name in first row - amazon-web-services

Related

Convert result of Athena query to S3 multipart files

Athena AWS is created empty table

Athena removing delimiter in output CSV files

Full text query in Amazon Athena is timing-out when using `LIKE`

Amazon Athena : How to store results after querying with skipping column headers?

Categories

Resources