Large Data transfer using AWS Lambda and AWS S3 - amazon-web-services

We have to create a large zip outputstream (500 MB - 1 GB max file size) using AWS Lambda (Java) and transfer it to S3. I am facing an issue here:
a. Connection timeout as the file is large
If the zip file is small, then it is working fine.
When I checked, it seems like multi-part upload might help. I am yet to try it, but wanted to know if there is a better option.
Does file transfer from AWS Lambda to S3 happens over Internet? Is there any way where I can use AWS optimized network since I do not have any need to use data from outside AWS network (the zip data is created within Lambda only). Will that help in faster data transfer?
The other option is to use multi part upload using Java (since we are a Java shop).
I understand that AWS Lambda has an execution timeout of 15 minutes. So if the data transfer takes long time, instead of a connection timeout, we might hit Lambda execution timeout. That is also not acceptable. So a fast data transfer would really help. The processing is trivial otherwise - I have to keep the Lambda running only because the data transfer takes long time.
Thanks in advance.

Related

Trigger a Custom Function Every X Hours in AWS

I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena

What Amazon service should I use in order to serve merged files from an S3 bucket?

I need an HTTP web-service serving files (1-10GiB) being result of merging some smaller files in S3 bucket. Such a logic is pretty easy to implement, but I need a very high scalability, so would prefer to put it on cloud. What Amazon service will be most feasible for this particular case? Should I use AWS Lambda for that?
Unfortunately, you can't achieve that with lambda, since it only offer 512mb for strage, and you can't mount volumes.You will need EBS or EFS to download and process the data. Since you need scalability, I would sugest Fargate + EFS. Plain EC2 instances would do just fine, but you might lose some money because it can be tricky to provision the correct amount for your needs, and most of the time it is overprovisioned.
If you don't need to process the file in real time, you can use a single instance and use SQS to queue the jobs and save some money. In that scenario you could use lambda to trigger the jobs, and even start/kill the instance when it is not in use.
Merging files
It is possible to concatenate Amazon S3 files by using the UploadPartCopy:
Uploads a part by copying data from an existing object as data source.
However, the minimum allowable part size for a multipart upload is 5 MB.
Thus, if each of your parts is at least 5 MB, then this would be a way to concatenate files without downloading and re-uploading.
Streaming files
Alternatively, rather than creating new objects in Amazon S3, your endpoint could simply read each file in turn and stream the contents back to the requester. This could be done via API Gateway and AWS Lambda. Your AWS Lambda code would read each object from S3 and keep returning the contents until the last object has been processed.
First, let me clarify your goal: you want to have an endpoint, say https://my.example.com/retrieve that reads some set of files from S3 and combines them (say, as a ZIP)?
If yes, does whatever language/framework that you're using support chunked encoding for responses?
If yes, then it's certainly possible to do this without storing anything on disk: you read from one stream (the file coming from S3) and write to another (the response). I'm guessing you knew that already based on your comments to other answers.
However, based on your requirement of 1-10 GB of output, Lambda won't work because it has a limit of 6 MB for response payloads (and iirc that's after Base64 encoding).
So in the AWS world, that leaves you with an always-running server, either EC2 or ECS/EKS.
Unless you're doing some additional transformation along the way, this isn't going to require a lot of CPU, but if you expect high traffic it will require a lot of network bandwidth. Which to me says that you want to have a relatively large number of smallish compute units. Keep a baseline number of them always running, and scale based on network bandwidth.
Unfortunately, smallish EC2 instances in general have lower bandwidth, although the a1 family seems to be an exception to this. And Fargate doesn't publish bandwidth specs.
That said, I'd probably run on ECS with Fargate due to its simpler deployment model.
Beware: your biggest cost with this architecture will almost certainly be data transfer. And if you use a NAT, not only will you be paying for its data transfer, you'll also limit your bandwidth. I would at least consider running in a public subnet (with assigned public IPs).

AWS lambda extract large data and upload to s3

I am trying to write a nodeJS lambda function to query data from our database cluster and upload this to s3, we require this for further analysis. But my doubt is, if the data to be queried from the db is large (9GB), how does the lambda function handle this as the memory limit is 3008 MB ?
There is also a disk storage limit of 500MB.
Therefore, you would need to stream the result to Amazon S3 as it is coming in from the database.
You might also run into a time limit of 15 minutes for a Lambda function, depending upon how fast the database can query and transfer that quantity of information.
You might consider an alternative strategy, such as having the Lambda function call Amazon Athena to query the database. The results of an Athena query are automatically saved to Amazon S3, which would avoid the need to transfer the data.
lambda have some limitations it term of run time and space . it's better to use crawler or job in amazon glue. it's the easy way of doing this.
for that go to `
amazon glue>>job>>create job
and fill basic requirements like source and destination.
and run job. there is no constrains for size and time limitation.
`

Accessing Large files stored in AWS s3 using AWS Lambda functions

I have more than 30GB file stored in s3,and I want to write an Lambda function which will access that file, parse it and then run some algorithm on the same.
I am not sure if my lambda function can take that big file and work on it as Max execution time for Lambda function is 300 sec(5 min).
I found AWS S3 feature regarding faster acceleration, but will it help?
Considering the scenario other than lambda function can any one suggest any other service to host my code as micro service and parse the file?
Thanks in Advance
It is totally based on the processing requirements and frequency of processing.
You can use Amazon EMR for parsing the file and run the algorithm, and based on the requirement you can terminate the cluster or keep it alive for frequent processing. https://aws.amazon.com/emr/getting-started/
You can try using Amazon Athena (Recently launched) service, that will help you for parsing and processing files stored in S3. The infrastructure need will be taken care by Amazon. http://docs.aws.amazon.com/athena/latest/ug/getting-started.html
For Complex Processing flow requirements, you can use combinations of AWS services like AWS DataPipeline - for managing the flow and AWS EMR or EC2 - to run the processing task.https://aws.amazon.com/datapipeline/
Hope this helps, thanks

Is it possible to transfer data from Redshift to Elasticsearch?

I'm working on something related to Amazon elasticsearch service.For that,I need to get data from Amazon Redshift.The data to be tranfered is huge i.e. 100 GB.Is there any way to get it directly form Redshift or is it a two step process like Redshift->s3->elasticsearch?
I see, at least in theory, 2 possible approaches for transfering data from Redshift to Elasticsearch:
Logstash, using the JDBC input plugin
elasticsearch-jdbc
Don’t gzip the data unloaded.
Use the bulk load on elastic
Use a large number of records in the bulk load (>5000) – fewer large bulk
loads are better than more smaller ones.
When working with AWS elastic search there is a risk of hitting the limits of the bulk queue size.
Process a single file in the lambda and then recursively call the lambda function with an event
Before recursing wait for a couple of seconds –> setTimeout. When waiting make sure that you aren’t idle for 30 seconds because your lambda will stop.
Don’t use s3 object creation to trigger your lambda — you’ll end up with
multiple lambda functions being called at the same time.
Don’t bother trying to put kinesis in the middle – unloading your data
into kinesis is almost certain to hit load limits in kinesis.
Monitor your elastic search bulk queue size with something like
this:
curl https://%ES-SERVER:PORT%/_nodes/stats/thread_pool |jq
‘.nodes |to_entries[].value.thread_pool.bulk’
It looks like there is no direct data transfer pipeline for pushing data into elasticsearch from Redshift. One alternative approach is to first dump the data in S3 and then push into elasticsearch.