Athena gzip compression query result has hybrid compressed-decompressed - amazon-athena

I'm setting AWS Athena with s3 bucket which has gzipped csv files.
And then query like this
SELECT * FROM "sample_db"."sample_table2" limit 100;
results is different take 1 and 2.
it seems like to mix compression / decompression results.
Is there any way getting result only decompressed result on Athena ?
file contents is below:
"title","user_info.client_user_id","user_info.player_id"
"test : csv take 4",,
"title","user_info.client_user_id","user_info.player_id"
"test : csv take 4",,
"title","user_info.client_user_id","user_info.player_id"
"test : csv take 4",,
"title","user_info.client_user_id","user_info.player_id"
"test : csv take 4",,
s3 has only one file test-sample.gz
Query Take 1
Query Take 2

Cause is wrong format query, partitioning for csv and corrupted data.
It is working on directly s3 gz upload in directories.

Related

ATHENA CREATE TABLE AS problem with parquet format

I'm creating a table in Athena and specifying the format as PARQUET however the file extension is not being recognized in S3. The type is displayed as "-" which means that the file extension is not recognized despite that I can read the files (written from Athena) successfully in a Glue job using:
df = spark.read.parquet()
Here is my statement:
CREATE EXTERNAL TABLE IF NOT EXISTS test (
numeric_field INT
,numeric_field2 INT)
STORED AS PARQUET
LOCATION 's3://xxxxxxxxx/TEST TABLE/'
TBLPROPERTIES ('classification'='PARQUET');
INSERT INTO test
VALUES (10,10),(20,20);
I'm specifying the format as PARQUET but when I check in the S3 bucket the file type is displayed as "-". Also when I check the glue catalog, that table type is set as 'unknown'
S3 STORAGE PRINT SCREEN
I expected that the type is recognized as "parquet" in the S3 bucket
After contacting the AWS support, it was confirmed that with CTAS queries Athena does not create file extensions for parquet files.
"Further to confirm this, I do see the Knowledge Center article [1] where CTAS generates the Parquet files without extension ( Under section 'Convert the data format and set the approximate file size' Point 5)."
However the files written from Athena, even without the extension are readable.
Reference:
[1] https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Workaround: I created a function to change the file extension. Basically iterating over the files in the S3 bucket and then writing the contents back to the same location with parquet file extension

Is it possible to unload data in AWS Athena to a single file?

The doc states that
UNLOAD results are written to multiple files in parallel.
I guess this is more efficient for both read and write, so unloading to a single file doesn't make sense. But, if for some reason the end user wants the output as a single file, is it possible?
Running a SELECT query in Athena produces a single result file in Amazon S3 in uncompressed CSV format this is the default behaviour.
If your query is expected to output a large result set then significant time is spent in writing results as one single file to Amazon S3. With UNLOAD, you can split the results into multiple files in Amazon S3, which reduces the time spent in the writing phase hence better performance and you can even use compression techniques like parquet.
What you are trying to do is not what unload is meant for. One solution would be to write some kind of post processor which will merge the files after the write is finished. Maybe using the lambda function which is triggered on S3 write.
Assumed your UNLOAD query is using TEXTFILE format and gzip compression like:
UNLOAD( select * from my_table )
TO 's3://your_bucket/your_path/'
WITH (
format = 'TEXTFILE',
compression = 'gzip',
field_delimiter = '\t'
)
A simple solution would be the following:
aws s3 cp --recursive s3://your_bucket/your_path/ .
gzip -d *
cat * > your_file.csv

Amazon AWS Athena HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split / Not valid Parquet file, parquet files compress to gzip with Athena

I'm trying to build skills on Amazon Athena.
I have already successed to query data in JSON and Apache Parquet format with Athena.
What I'm trying to do now is add compression (gzip) to it.
My JSON Data :
{
"id": 1,
"prenom": "Firstname",
"nom": "Lastname",
"age": 23
}
Then, I transform the JSON into Apache Parquet format with an npm module : https://www.npmjs.com/package/parquetjs
And finally, I compress the parquet file I get in GZIP format and put it in my s3 bucket : test-athena-personnes.
My Athena Table :
CREATE EXTERNAL TABLE IF NOT EXISTS personnes (
id INT,
nom STRING,
prenom STRING,
age INT
)
STORED AS PARQUET
LOCATION 's3://test-athena-personnes/'
tblproperties ("parquet.compress"="GZIP");
Then, to test it, I launch a very simple request: Select * from personnes;
I get the error message :
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://test-athena-personnes/personne1.parquet.gz (offset=0, length=257): Not valid Parquet file: s3://test-athena-personnes/personne1.parquet.gz expected magic number: [80, 65, 82, 49] got: [-75, 1, 0, 0]
Is there anything I didn't understand or that I'm doing bad? I can request apache parquet files without using gzip compression but not with it.
Thank you in advance
Parquet file consists of two parts[1]:
Data
Metadata
When you try reading this file through Athena then it will attempt to read the metadata first and then the actual data. In your case you are compressing the parquet file using Gzip and when Athena tried to read this file it fails to understand as the metadata is abstracted by the compression.
So the ideal way of compressing parquet file is "while writing/creating the parquet file" itself. So you need to mention the compression code while generating the file using parquetjs

Athena returns empty results from Firehose > Glue > S3 parquet setup

I have set up a Kinesis Firehose that passes data through glue which compresses to and transforms JSON to parquet and stores it in an S3 bucket. The transformation is successful and I can query the output file normally with apacheDrill. I cannot however get Athena to function. Doing a preview table (select * from s3data limit 10) I get results with the proper headers for the columns but the data is empty.
Steps I have taken:
I already added the newline to my source: JSON.stringify(event) + '\n';
Downloaded the parquet and queried successfully with apacheDrill
Glue puts the parquet file in YY/MM/DD/HH folders. I have tried moving the parquet to the root folder and I get the same empty results.
The end goal is to get data eventaully into Quicksights, so if I'm going about this wrong let me know.
What am I missing?

Decompress a zip file in AWS Glue

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.
Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?
Appreciate any replies.
Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.
You can try to decompression by lambda and invoke glue crawler for new folder.
Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.
I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.
Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.
datasource0 = glueContext.create_dynamic_frame
.from_options('s3',
{
'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
'compression':'bzip'
},
'csv',
{
'separator': ';'
}
)
I've written a Glue Job that can unzip s3 files and put them back in s3.
Take a look at https://stackoverflow.com/a/74657489/17369563