Amazon Athena and compressed S3 files - amazon-web-services

I have an S3 bucket with several zipped CSV files (utilization logs.) I'd like to query this data with Athena, but the output is completely garbled.
It appears Athena is trying to parse the zip files without decompressing them first. Is it possible to force Hive to recognize my files as compressed data?

For Athena compression is supported, but the supported formats are
Snappy (.snappy)
Zlib (.bz2)
GZIP (.gz)
Those formats are detected by their filename suffix. If the suffix doesn't match, the reader does not decode the content.
I tested it with a test.csv.gz file and it worked right away. So try changing the compression from zip to gzip and it should work.

Related

process non csv, json and parquet files from s3 using glue

Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.

Does Amazon S3 Manifest File Support Parquet Format?

According to this AWS documentation it appears that Amazon S3 does not support parquet format in the manifest file but I find this hard to believe because that's a very common file format that's used and for Athena/Redshift you are supposed to use parquet format from what I understand. Here's another piece of documentation that references the S3 manifest file in relation to Redshift and parquet file format but I'm not too sure what it means exactly https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html.
I'm just trying to create a data set in Amazon QuickSight using some parquet files in one of my S3 buckets. I tried omitting the globalUploadSettings field in my manifest and was able to pull the data in but QuickSight doesn't know what type of file it is so it just displays the information with a bunch of � characters.
Manifest I currently have:
{
"fileLocations": [
{
"URIPrefixes": [
"https://s3.amazonaws.com/myBucket/myFolderWithData/"
]
}
]
}
AWS S3 does not support parquet format in the manifest file, but you can use Athena as Dataset to support parquet format.
Importing File Data
You can use files in Amazon S3 or on your local (on-premises) network as data sources. QuickSight supports files in the following formats:
CSV and TSV – Comma-delimited and tab-delimited text files
ELF and CLF – Extended and common log format files
JSON – Flat or semistructured data files
XLSX – Microsoft Excel files
QuickSight supports UTF-8 file encoding, but not UTF-8 (with BOM).
Files in Amazon S3 that have been compressed with zip, or gzip (www.gzip.org
), can be imported as-is. If you used another compression program for files in Amazon S3, or if the files are on your local network, remove compression before importing them.
https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html
For s3 manifest file parquet format you need to mention content length as well .
Link : https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html
S3 manifest file example for parquet format:
{
"entries": [
{"url":"s3://mybucket/unload/manifest_0000_part_00", "meta": { "content_length": 5956875 }},
{"url":"s3://mybucket/unload/unload/manifest_0001_part_00", "meta": { "content_length": 5997091 }}
]
}

Decompress a zip file in AWS Glue

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.
Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?
Appreciate any replies.
Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.
You can try to decompression by lambda and invoke glue crawler for new folder.
Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.
I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.
Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.
datasource0 = glueContext.create_dynamic_frame
.from_options('s3',
{
'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
'compression':'bzip'
},
'csv',
{
'separator': ';'
}
)
I've written a Glue Job that can unzip s3 files and put them back in s3.
Take a look at https://stackoverflow.com/a/74657489/17369563

Using tar.gz file as a source for Amazon Athena

If I define *.tsv files on Amazon S3 as a source for an Athena table and use OpenCSVSerde or LazySimpleSerDe as a deserializer it works correctly. But if I define *.tar.gz files that include *.tsv files I see several strange rows in a table (e.g. a row that contains tsv file name and several empty rows). What is the right way to use tar.gz files in Athena?
The problem is tar, it adds additional rows. Athena can open only *.gz files, but not tar. So in this case I have to use *.gz instead of *.tar.gz.

Force AWS EMR to unzip files in S3

I have a bucket in AWS's S3 service that contains gzipped CSV files, however when they were stored they all were saved with the metadata Content-Type of text/csv.
Now I am using AWS EMR, which will not recognize them as a zipped file and unzip them. I've looked through configuration option for EMR but don't see anything that would work... I have almost a million files, so renaming their metadata value would require a Boto script that cycled through all the files and renamed the metadata value.
Am I missing something easy? Thanks!
The Content-Type isn't the problem... that's correct if the files are csv, but if you stored them gzipped, then you needed to also have set Content-Encoding: gzip in the header metadata. Doing that "should" trigger the useragent that's fetching them to gunzip them on the fly when they are downloaded... so had you done that, it should have "just worked."
(I store gzipped log files this way, with Content-Type: text/plain and Content-Encoding: gzip and when you download them with a web browser, the file you get is no longer gzipped because the browser untwizzles the compression on the fly due to the Content-Encoding header.)
But, since you've already uploaded the files, I did find this in the google machine, which might help:
GZipped input. A lot of my input data had already been gzipped, but luckily if you pass -jobconf stream.recordreader.compression=gzip in the extra arguments section Hadoop will decompress them on the fly before passing the data to your mapper.
http://petewarden.typepad.com/searchbrowser/2010/01/elastic-mapreduce-tips.html