Loading files from AWS S3 bucket to a Snowflake tables - amazon-web-services

I want to copy files from S3 bucket to the Snowflake. To do this I'm using Lambda function. In the S3 bucket I have a folders and in every folders there are many CSV files. These CSV files can small and huge. I have created a Lambda function that is loading these files to the Snowflake. The problem is that Lambda function can work only 15 minutes. It's not enough to load all the files to the Snowflake. Can you help me with this problem? I have one solution for this - execute lambda only with one file not with all files

As you said, the maximum execution time for a Lambda function is 15 minutes, and is not a good idea load all the file in the memory, because you will have high costs with execution time and high usage of memory.
But, if you really want to use Lambdas and you are dealing with files over 1GB, perhaps you should consider AWS Athena or optimizing your AWS Lambda function to read the file using a stream instead of loading the whole file into memory.
Other option may be to create a SQS message when the file lands on s3 and have an EC2 instance poll the queue and process as necessary. For more information check here: Running Cost-effective queue workers with Amazon SQS and Amazon EC2 Spot Instances.
The best option will be automate the Snowpipe with AWS Lambda, for this check the Snowpipe docs Automating Snowpipe with AWS Lambda.

Related

AWS Breaking a large file in S3 into small chucks of files based on number of rows present in the file

We will be receiving daily file from consumer which is 45 MB size .
We have a requirement to break this 45 file into small chunk of file based on a configurable number of rows.
Is there any AWS service available on top of S3 which can do this work
This process should be automated one and no manual intervention should be there.
Need to achieve it using Java (Preference).Any other language is also fine.
There is no "split my file into small chunks" service in AWS. You would need compute to perform this operation, such as an Amazon EC2 instance, an AWS Lambda function, or an AWS Fargate container.
If "automated one and no manual intervention" means "do this when the file is uploaded" then the most appropriate would be an AWS Lambda function since S3 can trigger the function upon upload.
The Lambda function will be provided with the Bucket and Key of the S3 object that triggered the function. You will need to write to code that downloads the file, splits it into multiple files and uploads them to S3. You might be able to do it in-memory depending on file size. There are plenty of examples online of how to use S3 from an AWS Lambda function.

AWS Lambda: Will Lambda incur cost if it is waiting for File event of Amazon S3

I have an Amazon S3 bucket to which I am loading files. After loading files, I create a specific file based on which the Lambda function will trigger. So while the Lambda function is waiting for the file to materialize in Amazon S3, will it incur cost? I am new starting out in AWS, this question might seem naïve but please do help.

Python Script as a Cron on AWS S3 buckets

I have a python script which copy files from one S3 bucket to another S3 bucket. This script needs to be run every Sunday at some specific time. I was reading some of articles and answers, So I tried to use AWS lambda + Cloudwatch events. This files run for minimum 30 minutes. would it be still good with Lambda as Lambda can run max 15 minutes only. Or is there any other way? I can create an EC2 box and run it as a Cron but that would be expensive. Or any other standard way?
The more appropriate way would be to use aws glue python shell job as it is under the serverless umbrella and you'll be charged as you go.
So this way you will only be charged for the time your code runs.
Also you don't need to manage the EC2 for this. This is like an extended lambda.
If the two buckets are supposed to stay in sync, i.e. all files from bucket #1 should eventually be synced to bucket #2, then there are various replication options in S3.
Otherwise look at S3 Batch Operations. You can derive the list of files that you need to copy from S3 Inventory which will give you additional context on the files, such as date/time uploaded, size, storage class etc.
Unfortunately, the lambda 15min execution time is a hard stop so it's not suitable for this use case as a big bang.
You could use multiple lambda calls to go through the objects one at a time and move them. However, you would need a DynamoDB table (or something similar) to keep track of what has been moved and what has not.
Another couple of options would be:
S3 Replication which will keep one bucket in sync with the other.
An S3 Batch operation
Or if its data files? you can always use AWS glue.
You can certainly use Amazon EC2 for a long-running batch job.
A t3.micro Linux instance costs $0.0104 per hour, and a t3.nano is half that price, charged per-second.
Just add a command at the end of the User Data script that will shut down the instance:
sudo shutdown now -h
If you launch the instance with Shutdown Behavior = Terminate, then the instance will self-terminate.

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog

Accessing Large files stored in AWS s3 using AWS Lambda functions

I have more than 30GB file stored in s3,and I want to write an Lambda function which will access that file, parse it and then run some algorithm on the same.
I am not sure if my lambda function can take that big file and work on it as Max execution time for Lambda function is 300 sec(5 min).
I found AWS S3 feature regarding faster acceleration, but will it help?
Considering the scenario other than lambda function can any one suggest any other service to host my code as micro service and parse the file?
Thanks in Advance
It is totally based on the processing requirements and frequency of processing.
You can use Amazon EMR for parsing the file and run the algorithm, and based on the requirement you can terminate the cluster or keep it alive for frequent processing. https://aws.amazon.com/emr/getting-started/
You can try using Amazon Athena (Recently launched) service, that will help you for parsing and processing files stored in S3. The infrastructure need will be taken care by Amazon. http://docs.aws.amazon.com/athena/latest/ug/getting-started.html
For Complex Processing flow requirements, you can use combinations of AWS services like AWS DataPipeline - for managing the flow and AWS EMR or EC2 - to run the processing task.https://aws.amazon.com/datapipeline/
Hope this helps, thanks