Does Amazon S3 Manifest File Support Parquet Format? - amazon-web-services

According to this AWS documentation it appears that Amazon S3 does not support parquet format in the manifest file but I find this hard to believe because that's a very common file format that's used and for Athena/Redshift you are supposed to use parquet format from what I understand. Here's another piece of documentation that references the S3 manifest file in relation to Redshift and parquet file format but I'm not too sure what it means exactly https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html.
I'm just trying to create a data set in Amazon QuickSight using some parquet files in one of my S3 buckets. I tried omitting the globalUploadSettings field in my manifest and was able to pull the data in but QuickSight doesn't know what type of file it is so it just displays the information with a bunch of � characters.
Manifest I currently have:
{
"fileLocations": [
{
"URIPrefixes": [
"https://s3.amazonaws.com/myBucket/myFolderWithData/"
]
}
]
}

AWS S3 does not support parquet format in the manifest file, but you can use Athena as Dataset to support parquet format.
Importing File Data
You can use files in Amazon S3 or on your local (on-premises) network as data sources. QuickSight supports files in the following formats:
CSV and TSV – Comma-delimited and tab-delimited text files
ELF and CLF – Extended and common log format files
JSON – Flat or semistructured data files
XLSX – Microsoft Excel files
QuickSight supports UTF-8 file encoding, but not UTF-8 (with BOM).
Files in Amazon S3 that have been compressed with zip, or gzip (www.gzip.org
), can be imported as-is. If you used another compression program for files in Amazon S3, or if the files are on your local network, remove compression before importing them.
https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html

For s3 manifest file parquet format you need to mention content length as well .
Link : https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html
S3 manifest file example for parquet format:
{
"entries": [
{"url":"s3://mybucket/unload/manifest_0000_part_00", "meta": { "content_length": 5956875 }},
{"url":"s3://mybucket/unload/unload/manifest_0001_part_00", "meta": { "content_length": 5997091 }}
]
}

Related

ATHENA CREATE TABLE AS problem with parquet format

I'm creating a table in Athena and specifying the format as PARQUET however the file extension is not being recognized in S3. The type is displayed as "-" which means that the file extension is not recognized despite that I can read the files (written from Athena) successfully in a Glue job using:
df = spark.read.parquet()
Here is my statement:
CREATE EXTERNAL TABLE IF NOT EXISTS test (
numeric_field INT
,numeric_field2 INT)
STORED AS PARQUET
LOCATION 's3://xxxxxxxxx/TEST TABLE/'
TBLPROPERTIES ('classification'='PARQUET');
INSERT INTO test
VALUES (10,10),(20,20);
I'm specifying the format as PARQUET but when I check in the S3 bucket the file type is displayed as "-". Also when I check the glue catalog, that table type is set as 'unknown'
S3 STORAGE PRINT SCREEN
I expected that the type is recognized as "parquet" in the S3 bucket
After contacting the AWS support, it was confirmed that with CTAS queries Athena does not create file extensions for parquet files.
"Further to confirm this, I do see the Knowledge Center article [1] where CTAS generates the Parquet files without extension ( Under section 'Convert the data format and set the approximate file size' Point 5)."
However the files written from Athena, even without the extension are readable.
Reference:
[1] https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Workaround: I created a function to change the file extension. Basically iterating over the files in the S3 bucket and then writing the contents back to the same location with parquet file extension

Amazon AWS Athena HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split / Not valid Parquet file, parquet files compress to gzip with Athena

I'm trying to build skills on Amazon Athena.
I have already successed to query data in JSON and Apache Parquet format with Athena.
What I'm trying to do now is add compression (gzip) to it.
My JSON Data :
{
"id": 1,
"prenom": "Firstname",
"nom": "Lastname",
"age": 23
}
Then, I transform the JSON into Apache Parquet format with an npm module : https://www.npmjs.com/package/parquetjs
And finally, I compress the parquet file I get in GZIP format and put it in my s3 bucket : test-athena-personnes.
My Athena Table :
CREATE EXTERNAL TABLE IF NOT EXISTS personnes (
id INT,
nom STRING,
prenom STRING,
age INT
)
STORED AS PARQUET
LOCATION 's3://test-athena-personnes/'
tblproperties ("parquet.compress"="GZIP");
Then, to test it, I launch a very simple request: Select * from personnes;
I get the error message :
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://test-athena-personnes/personne1.parquet.gz (offset=0, length=257): Not valid Parquet file: s3://test-athena-personnes/personne1.parquet.gz expected magic number: [80, 65, 82, 49] got: [-75, 1, 0, 0]
Is there anything I didn't understand or that I'm doing bad? I can request apache parquet files without using gzip compression but not with it.
Thank you in advance
Parquet file consists of two parts[1]:
Data
Metadata
When you try reading this file through Athena then it will attempt to read the metadata first and then the actual data. In your case you are compressing the parquet file using Gzip and when Athena tried to read this file it fails to understand as the metadata is abstracted by the compression.
So the ideal way of compressing parquet file is "while writing/creating the parquet file" itself. So you need to mention the compression code while generating the file using parquetjs

Is it possible to mix filetypes in an S3 bucket for AWS Athena?

I've got a set of files I want to put into AWS S3 so I can read them using AWS Athena. The directory structure looks like:
brand=mrwhippy/model=flake/serialnumber=0001/time=2019-09-11T02:57:33+0000Z/
But within that directory I've got several files of different types - INI and JSON.
Can I setup AWS Athena to handle this or do I need to convert the INI files to JSON?
If they're all JSON can I use the filenames to differentiate the values or do I need to put this at the base of the JSON tree? For instance:
Config.json:
{
"Config":{
"Setting1": 1,
"Setting2": "Cheese"
}
}
A table can be defined with a LOCATION that points to a directory.
Athena will process all files in that directory (including sub-directories) as belonging to that table.
Therefore, all files under that path need to be in the same format.

Decompress a zip file in AWS Glue

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.
Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?
Appreciate any replies.
Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.
You can try to decompression by lambda and invoke glue crawler for new folder.
Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.
I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.
Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.
datasource0 = glueContext.create_dynamic_frame
.from_options('s3',
{
'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
'compression':'bzip'
},
'csv',
{
'separator': ';'
}
)
I've written a Glue Job that can unzip s3 files and put them back in s3.
Take a look at https://stackoverflow.com/a/74657489/17369563

Amazon Athena and compressed S3 files

I have an S3 bucket with several zipped CSV files (utilization logs.) I'd like to query this data with Athena, but the output is completely garbled.
It appears Athena is trying to parse the zip files without decompressing them first. Is it possible to force Hive to recognize my files as compressed data?
For Athena compression is supported, but the supported formats are
Snappy (.snappy)
Zlib (.bz2)
GZIP (.gz)
Those formats are detected by their filename suffix. If the suffix doesn't match, the reader does not decode the content.
I tested it with a test.csv.gz file and it worked right away. So try changing the compression from zip to gzip and it should work.