AWS lambda extract large data and upload to s3

AWS lambda extract large data and upload to s3 - amazon-web-services

I am trying to write a nodeJS lambda function to query data from our database cluster and upload this to s3, we require this for further analysis. But my doubt is, if the data to be queried from the db is large (9GB), how does the lambda function handle this as the memory limit is 3008 MB ?

There is also a disk storage limit of 500MB.
Therefore, you would need to stream the result to Amazon S3 as it is coming in from the database.
You might also run into a time limit of 15 minutes for a Lambda function, depending upon how fast the database can query and transfer that quantity of information.
You might consider an alternative strategy, such as having the Lambda function call Amazon Athena to query the database. The results of an Athena query are automatically saved to Amazon S3, which would avoid the need to transfer the data.

lambda have some limitations it term of run time and space . it's better to use crawler or job in amazon glue. it's the easy way of doing this.
for that go to `
amazon glue>>job>>create job
and fill basic requirements like source and destination.
and run job. there is no constrains for size and time limitation.
`

Related

Trigger a Custom Function Every X Hours in AWS

I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.

Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena

How to query AWS load balancer log if there are terabytes of logs?

I want to query AWS load balancer log to automatically and on schedule send report for me.
I am using Amazon Athena and AWS Lambda to trigger Athena. I created data table based on guide here: https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html
However, I encounter following issues:
Logs bucket increases in size day by day. And I notice if Athena query need more than 5 minutes to return result, sometimes, it produce "unknown error"
Because the maximum timeout for AWS Lambda function is 15 minutes only. Therefore, I can not continue to increase Lambda function timeout to wait for Athena to return result (if in the case that Athena needs >15 minutes to return result, for example)
Can you guys suggest for me some better solution to solve my problem? I am thinking of using ELK stack but I have no experience in working with ELK, can you show me the advantages and disadvantages of ELK compared to the combo: AWS Lambda + AWS Athena? Thank you!

First off, you don't need to keep your Lambda running while the Athena query executes. StartQueryExecution returns a query identifier that you can then poll with GetQueryExecution to determine when the query finishes.
Of course, that doesn't work so well if you're invoking the query as part of a web request, but I recommend not doing that. And, unfortunately, I don't see that Athena is tied into CloudWatch Events, so you'll have to poll for query completion.
With that out of the way, the problem with reading access logs from Athena is that it isn't easy to partition them. The example that AWS provides defines the table inside Athena, and the default partitioning scheme uses S3 paths that have segments /column=value/. However, ALB access logs use a simpler yyyy/mm/dd partitioning Scheme.
If you use AWS Glue, you can define a table format that uses this simpler scheme. I haven't done that so can't give you information other than what's in the docs.
Another alternative is to limit the amount of data in your bucket. This can save on storage costs as well as reduce query times. I would do something like the following:
Bucket_A is the destination for access logs, and the source for your Athena queries. It has a life-cycle policy that deletes logs after 30 (or 45, or whatever) days.
Bucket_B is set up to replicate logs from Bucket_A (so that you retain everything, forever). It immediately transitions all replicated files to "infrequent access" storage, which cuts the cost in half.
Elasticsearch is certainly a popular option. You'll need to convert the files in order to upload it. I haven't looked, but I'm sure there's a Logstash plugin that will do so. Depending on what you're looking to do for reporting, Elasticsearch may be better or worse than Athena.

Copy data from S3 and post process

There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.

Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.

AWS data pipeline: dump data to 3 s3 nodes

I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?

Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.

Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.

Is it possible to transfer data from Redshift to Elasticsearch?

I'm working on something related to Amazon elasticsearch service.For that,I need to get data from Amazon Redshift.The data to be tranfered is huge i.e. 100 GB.Is there any way to get it directly form Redshift or is it a two step process like Redshift->s3->elasticsearch?

I see, at least in theory, 2 possible approaches for transfering data from Redshift to Elasticsearch:
Logstash, using the JDBC input plugin
elasticsearch-jdbc

Don’t gzip the data unloaded.
Use the bulk load on elastic
Use a large number of records in the bulk load (>5000) – fewer large bulk
loads are better than more smaller ones.
When working with AWS elastic search there is a risk of hitting the limits of the bulk queue size.
Process a single file in the lambda and then recursively call the lambda function with an event
Before recursing wait for a couple of seconds –> setTimeout. When waiting make sure that you aren’t idle for 30 seconds because your lambda will stop.
Don’t use s3 object creation to trigger your lambda — you’ll end up with
multiple lambda functions being called at the same time.
Don’t bother trying to put kinesis in the middle – unloading your data
into kinesis is almost certain to hit load limits in kinesis.
Monitor your elastic search bulk queue size with something like
this:
curl https://%ES-SERVER:PORT%/_nodes/stats/thread_pool |jq
‘.nodes |to_entries[].value.thread_pool.bulk’

It looks like there is no direct data transfer pipeline for pushing data into elasticsearch from Redshift. One alternative approach is to first dump the data in S3 and then push into elasticsearch.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js