How can we efficiently push data from csv file to dynamodb without using aws pipeline? - amazon-web-services

Considering the fact that there is no data pipeline available in Singapore region, are there any alternatives available to efficiently push csv data to dynamodb?

If it was me, I would setup an s3 event notification on a bucket that fires a lambda function each time a CSV file was dropped into it.
The Notification would let Lambda know that a new file was available and a lambda function would be responsible for loading the data into dynamodb.
This would work better (because of the limits of lambda) if the CSV files were not huge, so they could be processed in a reasonable amount of time, and the bonus is the only worked that would need to be done once it was working would be to simply drop the new files into the right bucket - no server required.
Here is a github repository that has a CSV->Dynamodb loader written in java - it might help get you started.

Related

Stream incremental file uploads off of GCS into Kafka

I'm working with a pipeline that pushes JSON entries in batches to my Gcloud Storage bucket. I want to get this data into Kafka.
The way I'm going about it now is using a lambda function that gets triggered every minute to find the files that have changed, open streams from them, read line by line and batch every so often those lines as messages into a kafka producer.
This process is pretty terrible, but it works.... eventually.
I was hoping there'd be a way to do this w/ Kafka Connect or Flink, but there really isn't much development around sensing incremental file additions to a bucket.
Do the JSON entries end up in different files in your bucket? Flink has support for streaming in new files from a source.

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

Analyze binary NetCDF files with AWS Quicksight / Athena

I have a task to analyze weather forecast data in Quicksight. The forecast data is held in NetCDF binary files in a public S3 bucket. The question is: how do you expose the contents of these binary files to Quicksight or even Athena?
There are python libraries that will decode the data from the binary files, such as Iris. They are used like this:
import iris
filename = iris.sample_data_path('forecast_20200304.nc')
cubes = iris.load(filename)
print(cubes)
So what would be the AWS workflow and services necessary to create a data ingestion pipeline that would:
Respond to an SQS message that a new binary file is available
Access the new binary file and decode it to access the forecast data
Add the decoded data to the set of already decoded data from previous SQS notifications
Make all the decoded data available in Athena / Quicksight
Tricky one, this...
What I would do is probably something like this:
Write a Lambda function in Python that is triggered when new files appear in the S3 bucket – either by S3 notifications (if you control the bucket), by SNS, SQS, or by schedule in EventBridge. The function uses the code snipplet included in your question to transform each new file and upload the transformed data to another S3 bucket.
I don't know the size of these files and how often they are published, so whether to convert to CSV, JSON, or Parquet is something you have to decide – if the data is small CSV will probably be easiest and will be good enough.
With the converted data in a new S3 bucket all you need to do is create an Athena table for the data set and start using QuickSight.
If you end up with a lot of small files you might want to implement a second step where you once per day combine the converted files into bigger files, and possibly Parquet, but don't do anything like that unless you have to.
An alternative way would be to use Athena Federated Query: by implementing Lambda function(s) that respond to specific calls from Athena you can make Athena read any data source that you want. It's currently in preview, and as far as I know all the example code is written in Java – but theoretically it would be possible to write the Lambda functions in Python.
I'm not sure whether it would be less work than implementing an ETL workflow like the one you suggest, but yours is one of the use cases for which Athena Federated Query was designed for and it might be worth looking into. If NetCDF files are common and a data source for such files would be useful for other people I'm sure the Athena team would love to talk to you and help you out.

Export DynamoDB table to S3 automatically

The scenario is the following: I have a lambda function that does an http request to get the data of today and the last 365 days and stores them in DynamoDB. The function is triggered every day at 8am, so the most recent data is always saved in the DynamoDB table.
Now my goal is to export the DynamoDB table to a S3 file automatically on an everyday basis as well, so I'm able to use services like QuickSight, Athena, Forecast on the data.
If possible and easily implementable, I'd like to only have one S3 file that gets added with the most recent data of the day, because an extra file everyday seems kinda pricey. If that's not possible, an extra file everyday would also be fine.
What's the best way to go about doing so without using CLI (because I'm not allowed to install programs to my laptop) and without using Lambda (because I wouldn't know how to write a function for that without any tutorials)?
Take a look at DataPipeline. This is a use case and most of the configuration is simple.
It will also not require any knowledge of Lambda and can be automated.
More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html
DynamoDB recently released a new, native feature to export your table's data to an S3 bucket. It supports exporting into DynamoDB JSON and Amazon Ion - see the documentation on how to use it at:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
This will enable you to run whatever analytics tools you'd like (Athena, etc.) on the data exported in S3.

AWS Lambda generates large size files to S3

Currently we are having a aws lambda (java based runtime) which takes a SNS as input and then perform business logic and generate 1 XML file , store it to S3.
The implementation now is create the XML at .tmp location which we know there is space limitation of aws lambda (500mb).
Do we have any way to still use lambda but can stream XML file to S3 without using .tmp folder?
I do research but still do not find solution for it.
Thank you.
You can directly load an object to s3 from memory without having to store it locally. You can use the put object API for this. However, keep in mind that you still have time and total memory limits with lambda as well. You may run out of those too if your object size is too big.
If you can split the file into chunks and don't require to update the beginning of the file while working with its end you can use multipart upload providing a ready to go chunk and then free the memory for the next chunk.
Otherwise you still need a temporary storage for form all the parts of the XML. You can use DynamoDB or Redis and when you collect there all the parts of the XML you can start uploading it part by part, then cleanup the db (or set TTL to automate the cleanup).