AWS Serverless CSV to Queue to CSV Architecture - amazon-web-services

I'm currently playing around with AWS for some serverless CSV processing. Decent familiarity with EC2 and Dynamo. I'm sure there is a better way to structure this, and I've not found an efficient way to store the data. Any architecture suggestions would be much appreciated.
This flow will take in a CSV uploaded to S3, process all the rows of tuples and output a new CSV of processed data to S3.
What's the 1) optimal architecture and 2) optimal place to store the data before the queue is complete until the CSV can be built
Data flow and service architecture:
CSV (contains tuples) (S3) -> CSV processing (Lambda) -> Queue (SNS) -> Queue Processing (Lambda) -> ????? temporary storage for queue items that have been processed before they get written to CSV ???? (what to use here?) -> CSV building (Lambda) -> CSV storage (S3)
Clever ideas appreciated.

I believe you are over complicating matters
s3 can trigger invoke a lambda function when events occur. This is directly set up in the s3 buckets event notifications
So use this to make a converted version of the CSV in another bucket
Amazon have an example of how to do this sort of thing here
http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

Update (reply to this comment):
it doesn't parallelize anything
You can divide the task equally if you have a good idea of how many tuples can be processed by a single Lambda within its time limit.
For example, given the following info...
original CSV contains 50,000 tuples
a single Lambda can process 5000 tuples within the time limit.
You can then do 10 parallel asynchronous invocations of the processor Lambda with each of them working with a different offset.
Original answer:
You can make it work with two Lambdas:
Listener
S3-triggered Lambda whose only job is to pass the s3 path of the new uploaded CSV to the Processor Lambda.
Processor
a Lambda that is triggered by the Listener. It will require the s3 path and the offset as parameters (where offset is the row of the CSV that it should start processing.
This Lambda performs the actual processing of your CSV rows. It should keep track of which row it's currently processing and just before the Lambda time limit is reached, it will stop and invoke itself with the same s3 path but with a new offset.
So, basically, it's a recursive Lambda that invokes itself until all CSV rows are processed.
To check for the remaining time, you can use the context.getRemainingTimeInMillis() method (NodeJS) in a while or for loop in your handler.

Related

AWS Lambda/DynamoDB

My Lambda is invoking other lambda, First lambda will be inserting the data into DynamoDB from S3 and second lambda get invoked. The second lambda will read the data from DynamoDb and create excel file in S3.
While inserting 100's records it worked good but while inserting 1500 + records, the first lambda is inserting the data correctly into DynamoDb and invoking the second lambda but the second lambda creates two files, one with correct number of records which expected and the other one is duplicated with less number than expected.
I tried increasing the time out for both lambda but did not work.
While i am not sure why your code will generate duplicate excels, i want to give some insights on application/architecture perspectives:
Why you need to insert something to ddb first, then use another process to read data back from ddb to generate excels? It seems to me the process can be done in a single lambda, that the lambda function read data from source S3 into local tmp directory or memory of lambda, and then insert data the ddb, and then use the local data to generate the excels in target S3.

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

Running multiple apache spark streaming jobs

I'm new to Spark streaming and as I can see there are different ways of doing the same thing which makes me a bit confused.
This is the scenario:
We have multiple events (over 50 different events) happening every minute and I want to do some data transformation and then change the format from json to parquet and store the data in a s3 bucket. I'm creating a pipeline where we get the data and store it in a s3 bucket and then the transformation happens (Spark jobs). My questions are:
1- Is it good if I run a lambda function which sorts out each event type in a separate subdirectories and then read the folder in sparkStreaming? or is it better to store all the events in a same directory and then read it in my spark streaming?
2- How can I run multiple sparkStreamings at the same time? (I tried to loop through a list of schemas and folders but apparently it doesn't work)
3- Do I need an orchestration tool (airflow) for my purpose? I need to look for new events all the time with no pause in between.
I'm going to use, KinesisFirehose -> s3 (data lake) -> EMR(Spark) -> s3 (data warehouse)
Thank you so much before hand!

"Realtime" syncing of large numbers of log files to S3

I have a large number of logfiles from a service that I need to regularly run analysis on via EMR/Hive. There are thousands of new files per day, and they can technically come out of order relative to the file name (e.g. a batch of files comes a week after the date in the file name).
I did an initial load of the files via Snowball, then set up a script that syncs the entire directory tree once per day using the 'aws s3 sync' cli command. This is good enough for now, but I will need a more realtime solution in the near future. The issue with this approach is that it takes a very long time, on the order of 30 minutes per day. And using a ton of bandwidth all at once! I assume this is because it needs to scan the entire directory tree to determine what files are new, then sends them all at once.
A realtime solution would be beneficial in 2 ways. One, I can get the analysis I need without waiting up to a day. Two, the network use would be lower and more spread out, instead of spiking once a day.
It's clear that 'aws s3 sync' isn't the right tool here. Has anyone dealt with a similar situation?
One potential solution could be:
Set up a service on the log-file side that continuously syncs (or aws s3 cp) new files based on the modified date. But wouldn't that need to scan the whole directory tree on the log server as well?
For reference, the log-file directory structure is like:
/var/log/files/done/{year}/{month}/{day}/{source}-{hour}.txt
There is also a /var/log/files/processing/ directory for files being written to.
Any advice would be appreciated. Thanks!
You could have a Lambda function triggered automatically as a new object is saved on your S3 bucket. Check Using AWS Lambda with Amazon S3 for details. The event passed to the Lambda function will contain the file name, allowing you to target only the new files in the syncing process.
If you'd like wait until you have, say 1,000 files, in order to sync in batch, you could use AWS SQS and the following workflow (using 2 Lambda functions, 1 CloudWatch rule and 1 SQS queue):
S3 invokes Lambda whenever there's a new file to sync
Lambda stores the filename in SQS
CloudWatch triggers another Lambda function every X minutes/hours to check how many files are there in SQS for syncing. Once there's 1,000 or more, it retrieves those filenames and run the syncing process.
Keep in mind that Lambda has a hard timeout of 5 minutes. If you sync job takes too long, you'll need to break it in smaller chunks.
You could set the bucket up to log HTTP requests to a separate bucket, then parse the log to look for newly created files and their paths. One troublespot, as well as PUT requests, you have to look for the multipart upload ops which are a sequence of POSTs. Best to log for a few days to see what gets created before putting any effort in to this approach

How can we efficiently push data from csv file to dynamodb without using aws pipeline?

Considering the fact that there is no data pipeline available in Singapore region, are there any alternatives available to efficiently push csv data to dynamodb?
If it was me, I would setup an s3 event notification on a bucket that fires a lambda function each time a CSV file was dropped into it.
The Notification would let Lambda know that a new file was available and a lambda function would be responsible for loading the data into dynamodb.
This would work better (because of the limits of lambda) if the CSV files were not huge, so they could be processed in a reasonable amount of time, and the bonus is the only worked that would need to be done once it was working would be to simply drop the new files into the right bucket - no server required.
Here is a github repository that has a CSV->Dynamodb loader written in java - it might help get you started.