I'm working on an iiot based monitoring solution and working with aws services. I'm batching the data received from iot core in this flow: -
iot core -> rules (to firehose delivery stream) -> kinesis firehose (900 second buffer) -> s3 bucket
the s3 prefix is as follows:-
partitionKey={partitionKeyFromQuery:device_name}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:hh}
but the issue with this flow is it makes the folder structure in s3 bucket as follows: -
partitionKey=wind-mill-01/year=2023/month=01/day=08/hour=08 (logs hour in utc time)
I want to log the time in ist format for the "hour" field is there any way possible to do it?
Any help will be greatly appreciated.
Firehose only support UTC as mentioned in the documentation. If you want to have it in IST for some reason, you can have a lambda (or glue job) moving object into another path with the hour in IST.
Related
I'm trying to store the iot data in aws s3 bucket but the size of data is very small around 1-3kb it makes no sense to store such small data in s3, I want to use aws sqs service (fifo queue) to keep data for around 15-30 minutes in the queue and then send the batched json data to s3 bucket and also want to do some calculations on the batched data before it gets stored in aws s3 bucket.
Want to create a flow like this :-
aws iot core -> aws iot core rules (to sqs) -> aws sqs (retain data in it for 15/30 mins) -> s3 bucket
the payload looks like this
{
"sensor_name" : "XXXX",
"temp" : 33.45,
"humidity" : 0.20
"timestamp" : epochtime (added from aws iot rules query)
}
Also the data should be stored in such format in the s3 after the batching of the data -> sensor_name/yyyy/mm/dd
So i can query the data with relative ease using aws athena to generate csv as and when required.
Tried aws kinesis but it is bit out of budget for me at this moment it costs a bit more.
Any help would be greatly appreciated
You can create a EventBridge rule to trigger a Lambda function every 15/30 min which reads all the data present in the queue, process the data and stores it in S3 in some aggregated format.
I used this approach to implement an image capture app: each frame is published as an SQS message and every hour a Lambda function reads the frames in the queue and creates a time-lapse video which is then stored in S3.
I'm currently brainstorming an idea and trying to figure out what are the missing pieces or a better way to solve this problem.
Assume I have a product that customers can embed on their website. My end goal is to build a dashboard on my website showing relevant analytics (such as page load, click, custom events) to my customer.
I separated this feature into 2 parts:
collection of data
We can collect data from 2 sources:
Embed of https://my-bucket/$customerId/product.json
CloudFront Logs -> S3 -> Kinesis Data Streams -> Kinesis Data Firehose -> S3
Http request POST /collect to collect an event
ApiGateway end point -> Lambda -> Kinesis Data Firehose -> S3
access of data
My dashboard will be calling GET /analytics?event=click&from=...&to=...&productId=...
The first part is straight forward:
ApiGateWay route -> Lambda
The struggling part: How can I have my Lambda accessing data at the moment stored on S3?
So far, I have evaluated this options:
S3 Glue -> Athena: Athena is not a high availability service. To my understand, some requests could take minutes to execute. I need something that is fast and snappy.
Kinesis Data Firehose -> DynamoDB: It is difficult to filter and sort on DynamoDB. I'm afraid that the high volume of analytics will slow it down and make it unpractical.
QuickSight: It doesn't expose an SQL way to get data
Kinesis Analytics: It doesn't expose an SQL way to get data
Amazon OpenSearch Service: Feels overkill (?)
Redshift: Looking into it next
I'm most probably misnaming what I'm trying to do as I can't seem to find any relevant help to solve this problem I would think must be quite common.
Does anyone know other than kinesis firehose, is there any other service from AWS can catch the S3 inject event? I am trying to do some analysis on VPC flow logs, currently setup is cloud-watch-logs -> Kinesis Firehose -> S3 -> Athena.
The problem is kinesis firehose can only buffer up to 128MB which is to small for me.
Events from Amazon S3 can go to:
AWS Lambda functions
Amazon SNS topic
Amazon SQS queue
So, you could send the messages to an SQS queue and then have a regular process (every hour?) that retrieves many messages and writes them to a single file.
Alternatively, you could use your current setup but use Amazon Athena on a regular basis to join multiple files by using CREATE TABLE AS. This would select from the existing files and store the output in a new table (with a new location). You could even use it to transform the files into a format that is easier to query in Athena (eg Snappy-compressed Parquet). The hard part is to only include each input file once into this concatenation process (possibly using SymlinkTextInputFormat).
I want to move(export) data from DynamoDB to S3
I have seen this tutorial but i'm not sure if the extracted data of dynamoDB will be deleted or coexits in DynamoDB and S3 at the same time.
What I expect is the data from dynamoDB will be deleted and stored in s3 (after X time stored in DynamoDB)
The main purpose of the project could be similar to this
There are any way to do this without have to develop a lambda function?
In resume, I have found this 2 different ways:
DynamoDB -> Pipeline -> S3 (Are the dynamoDB data deleted?)
DynamoDB -> TTL DynamoDB + DynamoDB stream -> Lambda -> firehose -> s3 (this appears to be more difficult)
Is this post currently valid for this purpouse?
What would be the simpliest and fasted way?
In your first option, as per default, data is not removed from dynamoDB. You can design a pipeline to make this work, but I think that is not the best solution.
In your second option, you must evaluate the solution based on your expected data volume:
If the data volume that will expire in TTL definition is not very
large, you can use lambda to persist removed data into S3 without
Firehose. You can design a simple lambda function to be triggered by
DynamoDB Stream and persist each stream event as a S3 object. You
can even trigger another lambda function to consolidate the objects
in a single file in the end of the day, week or month. But again,
based on your expected volume.
If you have a lot of data being expired at the same time and you
must perform transformations on this data, the best solution is to
use Firehose. Firehose can proceed with the transformation,
encryption and compact your data before sending it to S3. If the
volume of data is to big, using functions in the end of the day,
week or month may not be feasible. So it's better to perform all
this procedures before persisting it.
You can use AWS Pipeline to dump DynamoDB table to S3 and it will not be deleted.
I'm sending my logs to Kinesis Firehose using the API (C#). The Kinesis is sending it to Elasticsearch and I'm analyzing this data using Kibana.
For some reason, I see the records in Kibana in delay of few minutes (2-5).
I don't know in which step the delay is.
Why is it taking so long to the data to appear?
Most probably it is kibana.. I have tried it in past and was not much impressed. Can you enable AWS X-Ray .. That can give you visibility into which component is taking so long.
I needed to change the 'ES buffer interval (sec)*' in the stream details from 300 to 60.