Best way to convert JSON to Apache Parquet format using aws - amazon-web-services

I've been working on a project where I've been storing the iot data in s3 bucket and batching them using aws kinesis firehose, i have a lambda function running on the delivery stream where i convert the epoch milliseconds time to proper timestamp having date and time. here is my sample JSON payload
{
"device_name":"inHand-RTU",
"Temperature":22.3,
"Pyranometer":6,
"Active-Power":0,
"Voltage-1":233.93,
"Active-Import":2.57,
"time":"17-01-2023 10:49:09"
}
I now want to convert these files in s3 to parquet files and then do processing on them using apache pyspark.
What is the best way to do so? Should I use kinesis firehose itself where it provides the functionality to convert the data into parquet format, or should i go with aws glue jobs. Both the services does the same thing. what is the difference between both?
Which approach should I follow?
Any help will be greatly appreciated.

Best way is to use native parquet conversion as part of firehose.
Firehose has an option (Convert record format - Enable it) to convert to parquet or Orc format before delivering them to S3
https://docs.aws.amazon.com/firehose/latest/dev/create-transform.html

Related

Scrapy: write output into Amazon Kinesis Data Firehose

Instead of exporting my output (which is a .json file) into an s3 bucket I would like to export it into Amazon Kinesis Data Firehouse.
Is it possible to do this?
Where should I write the functions to handle this? I'm planning to use boto3

AWS GLUE transform for binary files in S3 from Protobuf (Google Protocol Buffers) for AWS Athena

firstly. I am a bit new to this so apologies if my terms are not correct.
What we are doing
We have files already in S3 in a Binary file format (e.g. Google Protocol Buffers) which we would like to run an ETL job to create our Data-Lake of transformed data that that will be accessed using either Amazon Redshift or Amazon Athena. In future we may stream via Kinesis.
Issue we face
We are looking at using AWS glue but its list of supported formats is limited (CSV, Json, Parquet, Orc, Avro, Grok) and doesn't provide a 'Custom/other' in the docs https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html
Thoughts
Is the a cost-effective way to pre-transform the data in S3 wihtin the Glue job to a Parquet input?
Is there to extend AWS Glue with our custom binary format
Maybe we are using the wrong AWS tool?
Key Considerations
Costs i.e. If we had to duplicate all the data in S3 for Glue to work on it versus Streaming in-memory transform somehow!
Later we hope to stream the data using Kinesis
Any help or experience you may have is greatly appreciated, especially examples or existing use-cases as I don't feel what we are trying to do is out of the ordinary... or is it?

Converting CloudFront access logs to Parquet format

What is the best way to convert CF access logs to parquet format and write it back to S3.
Currently I know of two common ways:
Trigger lambda on original log write to s3 and send it to AWS Kinesis
Firehose for conversion
Using CTAS periodically to convert an entire table
Which option should I use and what are the main differences between them?
For the ease of use: Glue crawler
Next thing in-line I can think of is glue job, then the more controlled approach would be emr spark cluster job for faster processing.
ECR/Kubernates are also brings a good argument.

Cannot Archive Data from AWS Kinesis to Glacier

I am working on a Data processing application hosted as a web service on an EC2, each second a small data file (less than 10KB) in .csv format is generated.
Problem Statement:
Archive all the data files generated to Amazon Glacier.
My Approach :
As data files are very small. I store the files in AWS Kinesis and after few hours i flush data to S3 (because i cannot find a direct way to put data from Kinesis to Glacier) and using S3 lifecycle management at the end of the day i archive all the objects to Glacier.
My Questions :
Is there a way to transfer data to Glacier directly from Kinesis ?
Is it possible to configure Kinesis to flush data to S3/Glacier at the end of the day ? Is there any time or memory limitation upto which Kinesis can hold data ?
If Kinesis cannot transfer data to Glacier directly. Is there a workaround for this like - can i write a lambda function which can fetch data from Kinesis and archive it to Glacier ?
Is it possible to merge all the .csv file at Kinesis or S3 or Glacier level ?
Is Kinesis suitable for my usecase ? Is there anything else i can use ?
I would be grateful if someone can take time and answer my questions and point me to some references. Please let me know if there is a flaw in my approach or if there is a better way to do so.
Thanks.
You can't directly put data from Kinesis into Glacier (unless you want to put the 10kb filea directly into Glacier)
You could consider Kinesis Data Firehose as a way of flushing 15min. Increments of data to S3
You can definitely do that. Glacier allows direct uploads so there's no need to upload to S3 first
You could use Firehose to flush to S3 then transform and aggregate using Athena then transition that file to Glacier. Or you use Lambda directly and upload straight to Glacier.
Perhaps streaming data into Firehose would make more sense. Depending on your exact needs IoT Analytics might also be interesting.
Reading your question again, seeing you use csv files, I would highly recommend using the Kinesis > S3 > Athena > Transition to glacier approach

Can I convert CSV files sitting on Amazon S3 to Parquet format using Athena and without using Amazon EMR

I would like to convert the csv data files that are right now sitting on Amazon S3 into Parquet format using Amazon Athena and push them back to Amazon S3 without taking any help from Amazon EMR. Is this possible to do it? Has anyone experienced something similar?
Amazon Athena can query data but cannot convert data formats.
You can use Amazon EMR to Convert to Columnar Formats. The steps are:
Create an external table pointing to the source data
Create a destination external table with STORED AS PARQUET
INSERT OVERWRITE <destination_table> SELECT * FROM <source_table>