Converting JSON file to Apache Parquet format using aws glue job - amazon-web-services

I'm trying to convert my JSON files in the s3 bucket put by Kinesis Firehose delivery stream using the aws glue job
Here is my JSON payload :-
{
"device_name": "inHand-RTU",
"Temperature": 27.1,
"Pyranometer": 14,
"Active-Power": 0,
"Voltage-1": 235.58,
"Active-Import": 2.57
}
Later in the rules engine I'm adding the timestamp by writing the query there, after adding the timestamp the payload looks like this:-
{
"device_name":"inHand-RTU",
"Temperature":27,
"Pyranometer":15,
"Active-Power":0,
"Voltage-1":236.59,
"Active-Import":2.5699999999999998,
"time":1673517687650
}
When i try to run the job in the glue studio it gives me following error:-
Unsupported case of DataType: com.amazonaws.services.glue.schema.types.StringType#e7b95c9 and DynamicNode: longnode
The structure in which the I've stored the files in s3 bucket is:-
<my-bucket-name>/site_name=inHand-rtu/year=2023/month=01/day=12/firehose-ds-to-s3-1-minute-ds-1-2023-01-12-10-01-27-62d4f06e-7fff-3bb5-89dd-c55860a0dbd9
I want my glue job to convert these files put in by firehose delivery streams to be converted into Apache Parquet from JSON into the same s3 bucket under folder "Parquet files", the way i want to partition the parquet files in the bucket is:-
<my-bucket-name>/Parquet Files/site_name=inHand-rtu/year=2023/month=01/day=12/ <parquet files>
Any help will be greatly appreciated.

Related

process non csv, json and parquet files from s3 using glue

Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.

Amazon AWS Athena HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split / Not valid Parquet file, parquet files compress to gzip with Athena

I'm trying to build skills on Amazon Athena.
I have already successed to query data in JSON and Apache Parquet format with Athena.
What I'm trying to do now is add compression (gzip) to it.
My JSON Data :
{
"id": 1,
"prenom": "Firstname",
"nom": "Lastname",
"age": 23
}
Then, I transform the JSON into Apache Parquet format with an npm module : https://www.npmjs.com/package/parquetjs
And finally, I compress the parquet file I get in GZIP format and put it in my s3 bucket : test-athena-personnes.
My Athena Table :
CREATE EXTERNAL TABLE IF NOT EXISTS personnes (
id INT,
nom STRING,
prenom STRING,
age INT
)
STORED AS PARQUET
LOCATION 's3://test-athena-personnes/'
tblproperties ("parquet.compress"="GZIP");
Then, to test it, I launch a very simple request: Select * from personnes;
I get the error message :
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://test-athena-personnes/personne1.parquet.gz (offset=0, length=257): Not valid Parquet file: s3://test-athena-personnes/personne1.parquet.gz expected magic number: [80, 65, 82, 49] got: [-75, 1, 0, 0]
Is there anything I didn't understand or that I'm doing bad? I can request apache parquet files without using gzip compression but not with it.
Thank you in advance
Parquet file consists of two parts[1]:
Data
Metadata
When you try reading this file through Athena then it will attempt to read the metadata first and then the actual data. In your case you are compressing the parquet file using Gzip and when Athena tried to read this file it fails to understand as the metadata is abstracted by the compression.
So the ideal way of compressing parquet file is "while writing/creating the parquet file" itself. So you need to mention the compression code while generating the file using parquetjs

Does Amazon S3 Manifest File Support Parquet Format?

According to this AWS documentation it appears that Amazon S3 does not support parquet format in the manifest file but I find this hard to believe because that's a very common file format that's used and for Athena/Redshift you are supposed to use parquet format from what I understand. Here's another piece of documentation that references the S3 manifest file in relation to Redshift and parquet file format but I'm not too sure what it means exactly https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html.
I'm just trying to create a data set in Amazon QuickSight using some parquet files in one of my S3 buckets. I tried omitting the globalUploadSettings field in my manifest and was able to pull the data in but QuickSight doesn't know what type of file it is so it just displays the information with a bunch of � characters.
Manifest I currently have:
{
"fileLocations": [
{
"URIPrefixes": [
"https://s3.amazonaws.com/myBucket/myFolderWithData/"
]
}
]
}
AWS S3 does not support parquet format in the manifest file, but you can use Athena as Dataset to support parquet format.
Importing File Data
You can use files in Amazon S3 or on your local (on-premises) network as data sources. QuickSight supports files in the following formats:
CSV and TSV – Comma-delimited and tab-delimited text files
ELF and CLF – Extended and common log format files
JSON – Flat or semistructured data files
XLSX – Microsoft Excel files
QuickSight supports UTF-8 file encoding, but not UTF-8 (with BOM).
Files in Amazon S3 that have been compressed with zip, or gzip (www.gzip.org
), can be imported as-is. If you used another compression program for files in Amazon S3, or if the files are on your local network, remove compression before importing them.
https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html
For s3 manifest file parquet format you need to mention content length as well .
Link : https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html
S3 manifest file example for parquet format:
{
"entries": [
{"url":"s3://mybucket/unload/manifest_0000_part_00", "meta": { "content_length": 5956875 }},
{"url":"s3://mybucket/unload/unload/manifest_0001_part_00", "meta": { "content_length": 5997091 }}
]
}

Athena returns empty results from Firehose > Glue > S3 parquet setup

I have set up a Kinesis Firehose that passes data through glue which compresses to and transforms JSON to parquet and stores it in an S3 bucket. The transformation is successful and I can query the output file normally with apacheDrill. I cannot however get Athena to function. Doing a preview table (select * from s3data limit 10) I get results with the proper headers for the columns but the data is empty.
Steps I have taken:
I already added the newline to my source: JSON.stringify(event) + '\n';
Downloaded the parquet and queried successfully with apacheDrill
Glue puts the parquet file in YY/MM/DD/HH folders. I have tried moving the parquet to the root folder and I get the same empty results.
The end goal is to get data eventaully into Quicksights, so if I'm going about this wrong let me know.
What am I missing?

Decompress a zip file in AWS Glue

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.
Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?
Appreciate any replies.
Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.
You can try to decompression by lambda and invoke glue crawler for new folder.
Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.
I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.
Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.
datasource0 = glueContext.create_dynamic_frame
.from_options('s3',
{
'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
'compression':'bzip'
},
'csv',
{
'separator': ';'
}
)
I've written a Glue Job that can unzip s3 files and put them back in s3.
Take a look at https://stackoverflow.com/a/74657489/17369563