Processing huge csv file from aws s3 to database - amazon-web-services

I have a csv file consisting of 2M records which would be uploaded to AWS S3 once or twice every day.I need to dump this file in our database which can at time handle approximately ~1K records OR ~40-50k/min using batch upload.
I was planning to use AWS lambda but since it has timeout of 15min I would only be able to insert ~0.7M records.I also read that we can invoke another lambda function with new offset but I am looking to process this file at a stretch.
What should be my ideal approach for such scenarios.Should I spin up an EC2 instance for handling batch uploads ?
Any help would be appreciated

Consider using Database Migration Service.
You can migrate data from an Amazon S3 bucket to a database using AWS DMS. The source data files must be in comma-separated value (.csv) format.

Why don't you have one lambda running through the file and inserting the records into SQS?
Pretty sure this takes less than 15 minutes. A second Lambda consumes the records from SQS and inserts them into the database. This way you don't risk overloading your database since the lambda won't retrieve more than 10 records from the queue.
Of course this is one solution of many.

Related

Loading files from AWS S3 bucket to a Snowflake tables

I want to copy files from S3 bucket to the Snowflake. To do this I'm using Lambda function. In the S3 bucket I have a folders and in every folders there are many CSV files. These CSV files can small and huge. I have created a Lambda function that is loading these files to the Snowflake. The problem is that Lambda function can work only 15 minutes. It's not enough to load all the files to the Snowflake. Can you help me with this problem? I have one solution for this - execute lambda only with one file not with all files
As you said, the maximum execution time for a Lambda function is 15 minutes, and is not a good idea load all the file in the memory, because you will have high costs with execution time and high usage of memory.
But, if you really want to use Lambdas and you are dealing with files over 1GB, perhaps you should consider AWS Athena or optimizing your AWS Lambda function to read the file using a stream instead of loading the whole file into memory.
Other option may be to create a SQS message when the file lands on s3 and have an EC2 instance poll the queue and process as necessary. For more information check here: Running Cost-effective queue workers with Amazon SQS and Amazon EC2 Spot Instances.
The best option will be automate the Snowpipe with AWS Lambda, for this check the Snowpipe docs Automating Snowpipe with AWS Lambda.

How to handle Amazon S3 updates for transactional data

I have a transactional table that I load from SQL server to Amazon S3 S3 using AWS DMS. For handling updates I move the old files to archive and then process only the incremental records everytime.
This is fine when I have only insert operation in my database. But problem comes when we need to accomodate the updates. Right now for doing any updates we read the entire S3 file and make changes to the records which are updated as a part of incremental load. Then as the data keeps on increasing the process of reading the entire file in S3 bucket and updating it would take more time and in future the job might not be able to end in time (Considering that job needs to end in 1 hour).
This can be handled using Databricks where we can use Delta Table to update the records and finally overwrite the existing file. But Databricks is bit expensive.
How do we handle the same using AWS Glue?

Copy records(row by row) from S3 to SQS - Using AWS Batch

I have setup a Glue Job which runs concurrently to process input files and writes it down to S3. The Glue job runs periodically (not a one time job).
The output in S3 is in a form of csv file. The requirement is to copy all those records into Aws SQS. Assuming there might be 100s of files, each containing upto million records.
Initially i was planning to have a lambda event to send the records row by row. however, from the doc i see a time limit for lambda as 15 mins- https://aws.amazon.com/about-aws/whats-new/2018/10/aws-lambda-supports-functions-that-can-run-up-to-15-minutes/#:~:text=You%20can%20now%20configure%20your,Lambda%20function%20was%205%20minutes.
Will it be better to use AWS Batch for copying the records from S3 to SQS ? I believe, AWS Batch has the capability to scale the process when needed and also perform the task in parallel.
I want to know if AWS Batch is a right pick or am i trying to more complicate the design ?

AWS lambda extract large data and upload to s3

I am trying to write a nodeJS lambda function to query data from our database cluster and upload this to s3, we require this for further analysis. But my doubt is, if the data to be queried from the db is large (9GB), how does the lambda function handle this as the memory limit is 3008 MB ?
There is also a disk storage limit of 500MB.
Therefore, you would need to stream the result to Amazon S3 as it is coming in from the database.
You might also run into a time limit of 15 minutes for a Lambda function, depending upon how fast the database can query and transfer that quantity of information.
You might consider an alternative strategy, such as having the Lambda function call Amazon Athena to query the database. The results of an Athena query are automatically saved to Amazon S3, which would avoid the need to transfer the data.
lambda have some limitations it term of run time and space . it's better to use crawler or job in amazon glue. it's the easy way of doing this.
for that go to `
amazon glue>>job>>create job
and fill basic requirements like source and destination.
and run job. there is no constrains for size and time limitation.
`

Approaches for migrating .csv files stored in S3 to DynamoDB?

We have a hundreds of thousands of .csv files stored in S3 that contain at least several data records each. (each record is its own row)
I am trying to design a migration strategy to transform all the records in the .csv files and put them into DynamoDB. During the migration, I'd also like to ensure that if any new .csv gets added to the S3 bucket, we automatically trigger a lambda or something to do the transformation and write to DynamoDB as well.
Eventually we'd stop writing to S3 entirely, but initially we need to keep those writes and any writes to S3 to also trigger a write to DynamoDB. Does anyone know of any good strategies for doing this? (Is there something like DynamoDB streams except for S3?) Any strategies for getting the existing things in .csv in S3 over to DynamoDB in general?
AWS has many tools you can use to solve this problem. Here are a few.
You could use AWS Database Migration Service. It supports migrating data from S3 and into DynamoDB. This AWS product is designed specifically for your use case, and it handles pretty much everything.
Once the migration has started, DMS manages all the complexities of the migration process including automatically replicating data changes that occur in the source database during the migration process.
S3 can publish events to trigger a lambda function which can be used to continuously replicate the data to DynamoDB.
AWS Data Pipelines basically does batch ETL jobs, which could move your data all at once from S3 to DynamoDB. You might also be able to run periodic sync jobs if you can tolerate a delay in replicating data to DynamoDB.
AWS Glue can crawl your data, process it, and store it in another location. I think it would provide you with an initial load plus the ongoing replication. While it could work, it’s designed more for unstructured data, and you have CSV files which are usually structured.
I’d recommend using AWS Database Migration Service because it’s the one-stop solution, but if you can’t use it for some reason, there are other options.
I don't know if DynamoDB has "load records from CSV" feature (RedShift does).
If it does not, then you could roll your own. Write a Python function that imports the csv and boto3 modules, takes as input an S3 path (inside an event dictionary). The function would them download the file from S3 to temp dir, parse it with csv, then use boto3 to insert into DynamoDB.
To get the history loaded, write a function that uses `boto3' to read the list of objects in S3, then call the first function to upload to DynamoDB.
To get future files loaded, install the first function as a Lambda function, and add a trigger from S3 Object Creation events to run the function whenever a new object is put onto S3.