What is the easiest way to automatically ingest csv-data from a S3 bucket into a Timestream database ?
I have a s3-bucket which continuasly is generating csv files inside a folder structure. I want to save these files inside a timestream database so i can visualize them inside my grafana instance.
I already tried to do that via a Glue crawler but that wont wont for me. Is there any workaround or tutorial on how to solve this task ?
I do this using a Lambda function, an SNS topic and a queue.
New files in my bucket triggers a notification on an SNS Topic
The notification gets added to an SQS queue.
The lambda function consumes the queue, recovers the bucket and key of the new s3 object, downloads the csv file, does some processing and ingests the data into timestream. The lambda is implemented in Python.
This has been working ok, with the caveat that large files may not ingest fully within the lambda 15 minute limit. Timestream is not super fast. It gets better by using multi-valued records, as well as using the "common attributes' feature of the timestream client in boto3.
(it should be noted that the lambda can be triggered directly by the S3 bucket, if one prefers. Using a queue allows a bit more flexibility, such as being able to manually add files to the queue for reprocessing)
Related
I have live streaming data coming my way different sources. We have created an AWS SQS Queue to get those data. I was directly pushing this incoming data to DynamoDB using lambda function but I have been asked to store the raw files into s3 bucket first. So that we do not have any data loss and later use that data for further transformations.
I have an AWS SQS Queue and I want to store the messages of this queue into an S3 bucket in JSON or Parquet form. Or is there any other alternative other than s3 to store the raw data in AWS?
I tried searching over net but couldn't find anything concrete. I am new to this please help me solve this.
I tried searching over net but couldn't find anything concrete.
That's correct, because there is no such AWS provided feature or tool. You have to implement custom solution for that yourself.
Since you are using lambda which reads from the queue, you can add code to the lambda function to write that data to S3 before writing to DynamoDB.
Does anyone know other than kinesis firehose, is there any other service from AWS can catch the S3 inject event? I am trying to do some analysis on VPC flow logs, currently setup is cloud-watch-logs -> Kinesis Firehose -> S3 -> Athena.
The problem is kinesis firehose can only buffer up to 128MB which is to small for me.
Events from Amazon S3 can go to:
AWS Lambda functions
Amazon SNS topic
Amazon SQS queue
So, you could send the messages to an SQS queue and then have a regular process (every hour?) that retrieves many messages and writes them to a single file.
Alternatively, you could use your current setup but use Amazon Athena on a regular basis to join multiple files by using CREATE TABLE AS. This would select from the existing files and store the output in a new table (with a new location). You could even use it to transform the files into a format that is easier to query in Athena (eg Snappy-compressed Parquet). The hard part is to only include each input file once into this concatenation process (possibly using SymlinkTextInputFormat).
I want to periodically insert data from S3 (or other fonts) into Amazon Redshift, i.e., when data is added to my S3 bucket, I want an option to add it automatically to my Amazon Redshift cluster.
My preferred method for doing this is to establish a trigger that fire every time a file is created in a part of a bucket. This trigger creates an event that initiates a Lambda function that issues the desired SQL to Redshift. (Or if the work that is needed in Redshift is complex or long running I will use a step function but this is rare.)
Example setups for this:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html
https://64lines.medium.com/building-a-aws-lambda-function-to-run-aws-redshift-sql-scripts-in-python-7468b7c2fdea
I'd start simple if you can and work up to Redshift Data API and Step functions.
You can automate the insertion of data from S3 with a scheduled Lambda that triggers periodically. This might be a better solution than invoking a Lambda on every object upload, especially if you are receiving lots of files continuously.
I have gone through couple of stackoverflow questions regarding hourly backups from DDB to S3 where the best solution turned out to be to enable DDB Stream, subscribe lambda function and push to S3.
I am trying to understand if directly pushing from Lambda to S3 is fine or from Lambda to Kinesis Firehose and then to S3. Can someone share what is the advantage if we introduce Firehose in between. We anyways trigger lambda only after specific batch window that implies we are already buffering there.
Thanks in advance.
Firehose gives you the possibility to convert and compress your data. In addition you can directly attach a Glue Metadata table, so you can query your data with Athena.
You can write a Lambda function that reads a DynamoDB table, gets a result set, encodes the data to some format (ie, JSON), then place that JSON into an Amazon S3 bucket. You can use scheduled events to fire off the Lambda function on a regular schedule.
Here in AWS tutorial that shows you how to use scheduled events to invoke a Lambda function:
Creating scheduled events to invoke Lambda functions
This AWS tutorial also shows you how to read data from an Amazon DynamoDB table from a Lambda function.
There are examples and documentation on copying data from Kafka topics to S3 but how do you copy data from S3 to Kafka?
When you read an S3 object, you get a byte stream. And you can send any byte array to Kafka with ByteArraySerializer.
Or you can parse that InputStream to some custom object, then send that using whatever serializer you can configure.
You can find one example of a Kafka Connect process here (which I assume you are comparing to Confluent's S3 Connect writer) - https://jobs.zalando.com/tech/blog/backing-up-kafka-zookeeper/index.html that can be configured to read binary archives or line-delimted text from S3.
Similarly, Apache Spark, Flink, Beam, NiFi, etc. simlar Hadoop related tools can read from S3 and write events to Kafka as well.
The problems with this approach is that you need to keep track of what files have been read so far, as well as handle partially read files.
Depending upon your scenario or desired frequency for uploading the objects, you can either use Lambda function on each event (e.g. every time a file is uploaded) or as a cron. This lambda works as a producer by using Kafka API and publishes to a topic.
Specifics:
The trigger for the Lambda function can be the s3:PutObject event coming from directly s3 or cloudwatch events.
You can run lambda as a cron if you don't need the objects instantly. The alternate in this case could also be running a cron on an EC2 instance which has Kafka producer and permissions to read objects from s3 and it keeps pushing them to the kafka topics.