Accessing Large files stored in AWS s3 using AWS Lambda functions - amazon-web-services

I have more than 30GB file stored in s3,and I want to write an Lambda function which will access that file, parse it and then run some algorithm on the same.
I am not sure if my lambda function can take that big file and work on it as Max execution time for Lambda function is 300 sec(5 min).
I found AWS S3 feature regarding faster acceleration, but will it help?
Considering the scenario other than lambda function can any one suggest any other service to host my code as micro service and parse the file?
Thanks in Advance

It is totally based on the processing requirements and frequency of processing.
You can use Amazon EMR for parsing the file and run the algorithm, and based on the requirement you can terminate the cluster or keep it alive for frequent processing. https://aws.amazon.com/emr/getting-started/
You can try using Amazon Athena (Recently launched) service, that will help you for parsing and processing files stored in S3. The infrastructure need will be taken care by Amazon. http://docs.aws.amazon.com/athena/latest/ug/getting-started.html
For Complex Processing flow requirements, you can use combinations of AWS services like AWS DataPipeline - for managing the flow and AWS EMR or EC2 - to run the processing task.https://aws.amazon.com/datapipeline/
Hope this helps, thanks

Related

Trigger a Custom Function Every X Hours in AWS

I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena

Large Data transfer using AWS Lambda and AWS S3

We have to create a large zip outputstream (500 MB - 1 GB max file size) using AWS Lambda (Java) and transfer it to S3. I am facing an issue here:
a. Connection timeout as the file is large
If the zip file is small, then it is working fine.
When I checked, it seems like multi-part upload might help. I am yet to try it, but wanted to know if there is a better option.
Does file transfer from AWS Lambda to S3 happens over Internet? Is there any way where I can use AWS optimized network since I do not have any need to use data from outside AWS network (the zip data is created within Lambda only). Will that help in faster data transfer?
The other option is to use multi part upload using Java (since we are a Java shop).
I understand that AWS Lambda has an execution timeout of 15 minutes. So if the data transfer takes long time, instead of a connection timeout, we might hit Lambda execution timeout. That is also not acceptable. So a fast data transfer would really help. The processing is trivial otherwise - I have to keep the Lambda running only because the data transfer takes long time.
Thanks in advance.

Aws lambda vs aws batch

I am currently working on a project where I need to merge two significantly large csv files into one(both are a few hundred MBs). I am fairly new to aws. I am aware of memory allocation and execution time limitations of lambda. Other than that are there any advantages of using batch jobs over lambda for this project? Is there any other aws component which more suitable for this task? Either lambda or batch job will be triggered inside a step function using a sns notification.
Lambda function has some limitations:
Execute time: 15 mins
RAM: 3G
disk space /tmp only 500mb <= difficult to store any file large than this number on lambda
The good point is cheap and fast boot up
I suggest you use the ECS (Both Fargate and Container are good)
Try using a Python function in Lambda that writes to S3 with boto3.

Idea and guidelines on end to end AWS solution

I want to build an end to end automated system which consists of the following steps:
Getting data from source to landing bucket AWS S3 using AWS Lambda
Running some transformation job using AWS Lambda and storing in processed bucket of AWS S3
Running Redshift copy command using AWS Lambda to push the transformed/processed data from AWS S3 to AWS Redshift
From the above points, I've completed pulling data, transforming data and running manual copy command from a Redshift using a SQL query tool.
Doubts:
I've heard AWS CloudWatch can be used to schedule/automate things but never worked on it. So, if I want to achieve the steps above in a streamlined fashion, how to go about it?
Should I use Lambda to trigger copy and insert statements? Or are there better AWS services to do the same?
Any other suggestion on other AWS Services and of the likes are most welcome.
Constraint: Want as many tasks as possible to be serverless (except for semantic layer, Redshift).
CloudWatch:
Your options here are either to use CloudWatch Alarms or Events.
With alarms, you can respond to any metric of your system (eg CPU utilization, Disk IOPS, count of Lambda invocations etc) when it crosses some threshold, and when this alarm is triggered, invoke a lambda function (or send SNS notification etc) to perform a task.
With events you can use either a cron expression or some AWS service event (eg EC2 instance state change, SNS notification etc) to then trigger another service (eg Lambda), so you could for example run some kind of clean-up operation via lambda on a regular schedule, or create a snapshot of an EBS volume when its instance is shut down.
Lambda itself is a very powerful tool, and should allow you to program a decent copy/insert function in a language you are familiar with. AWS has several GitHub repos with lots of examples too, see for example the serverless examples and many samples. There may be other services which could work for you in your specific case, but part of Lambda's power is its flexibility.

AWS Lambda - Work with services outside Amazon services?

After reading about AWS Lambda I've taken a quite interest in it. Although there is one thing that I couldn't really find any info on. So my question is, is it possible to have lambda work outside Amazon services? Say if I have a database from some other provider, would it be possible to perform operations on it through AWS Lambda?
Yes.
AWS Lambda functions run code just like any other application. So, you can write a Lambda function that calls any service on the Internet, just like any computer. Of course, you'd need access to the external service across the Internet and access permissions.
There are some limitations to Lambda functions, such as functions only running for a maximum of five minutes and only 500MB of local disk space.
So when should you use Lambda? Typically, it's when you wish some code to execute in response to some event, such as a file being uploaded to Amazon S3, data being received via Amazon Kinesis, or a skill being activated on an Alexa device. If this fits your use-case, go for it!