I'd like to create a Lambda function that COPY on Redshift the data of a file PUT in a specific S3 Bucket, but I can't figure how to do that.
So far, I created a LAMBDA function that triggers whenever a .csv file is PUT on the S3 bucket and I managed to COPY the data from a local .csv file to Redshift.
Now I'd like to be helped on how to COPY the data using a Lambda function. I searched on the internet, but can't manage to have proper examples using Lambda.
I use powershell to export data from SQL Server and upload it to an S3 bucket. I then have a Lambda function with an S3 put trigger that executes a stored procedure in Redshift that contains the copy command in a dynamic statement, in order to load data into different tables, and it will trigger's every time a file is uploaded, so multiple tables multiple times.
The solution approach-
Lambda trigger works if you placed any file in S3
The data pipeline automated using below -
Flatfile - > S3 -> dynamodb copy comand - Lamdbda execute the script inside dynamodb - > load data on Redshift
Related
New to AWS Lambda Development
My lambda is reading a parquet file path from S3 location to move the data into DynamoDB.
Have a use case where I will have to trigger the same Lambda function to read the new files uploaded in multiple S3 paths. I am not sure how can we pass the file path into the Lambda call. If I put it in example -
Lambda1 Call 1 for reading parquet from file_path1 - new parquet uploaded in path
file_path1 - /abc/abc1/abc_1.parquet
Lambda1 Call 2 for reading parquet from file_path2 - new parquet uploaded in path
file_path2 - /xyz/xyz1/xyz_1.parquet
I am not sure if I have to provision multiple lambdas or can use one single lambda to be called multiple time with different file paths
You can use a single Lambda function for this, there is no reason to provision multiple copies of the function. You can probably just setup a single trigger on the S3 bucket that triggers the Lambda function on any new files. The path of the new file will be in the event parameter that your Lambda function handler receives.
I have successfully setup DMS to copy data from RDS (SQL Server) to S3 in csv format (Full load). However, upon running the task, DMS copies the source table and creates multiple csv files in S3 for the single table. Is there any way to make sure that for 1 table, DMS only creates one target csv file in S3?
The first full load operation will load all data into one file.
For on-going replicated data, migrated data has different format, it contains additional character like this:
I: for inserted record
U: for changed one
D: for deleted one
So, they can not be merged into one file.
You can do this by using Lambda, but it's not a good way:
Add trigger to Lambda function on S3 bucket whenever any data change is made on above S3 bucket - which contains csv files
In Lambda function: handle file in each above cases and merge them in by your self.
I suggest to use other DB target like MySQL, Postgres, etc. As they support them all.
when file upload to s3 bucket lambda must trigger
and then Athena should run and based on file name Athena should pass query to lambda execution results
It appears that your goal is to:
Upload a file to Amazon S3
This file should trigger an AWS Lambda function
The Lambda function should execute a query on Amazon Athena, using information from the filename of the file that was uploaded
It should then return the results of the query to somewhere
This is possible, yes. However, when the Lambda function receives the results from Amazon Athena, it cannot "pass back" the results, since it was triggered by Amazon S3 and there is nowhere to "pass back" those results. Instead, you might want to store the results in another file in S3.
You would need to write the Lambda function. It can call the Amazon Athena API to run a query and store the results in a new file.
I would like to find out how I can have sample files on my S3 buckets that can be processed by my lambda functions and then be able to dump data into redshift.
I know we can load data from S3 to Redshift using the COPY command from the following aws doc: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html
What would be the process for files on S3 to Redshift using after being processed by Lambda functions?
Configure the S3 bucket to trigger your Lambda function when new files are uploaded.
The Lambda function can copy the file from S3 to the Lambda environment's /tmp folder and then perform whatever processing is needed.
Once the processing is complete, if you want to perform a Redshift COPY command, then the Lambda function would need to first copy the new file to a different location in S3, perhaps a completely different bucket, and then issue the COPY command to the Redshift cluster. Alternatively the Lambda function could open a connection to the Redshift cluster and issue INSERT statements directly.
If you want to decouple the process further, you could have the Lambda function simply copy the final output to another S3 bucket and quit. Then have the second S3 bucket trigger a second Lambda function that issues the COPY command to Redshift.
I have csv log data coming every hour in a single s3 bucket and I want to partition it for improving queries performance as well as converting it to parquet.
Also how can I add partitions automatically for new logs that will be added.
Note :
csv file names follow standard date format
files are written from external source and cannot be edited to be written in folders but only in the main bucket
I wanted to convert csv files to parquet separately
It appears that your situation is:
Objects are being uploaded to an Amazon S3 bucket
You would like those objects to be placed in a path hierarchy to support Amazon Athena partitioning
You could configure an Amazon S3 event to trigger an AWS Lambda function whenever a new object is created.
The Lambda function would:
Read the filename (or the contents of the file) to determine where it should be placed in the hierarchy
Perform a CopyObject() to put the object in the correct location (S3 does not have a 'move' command)
Delete the original object with DeleteObject()
Be careful that the above operation does not result in an event that triggers the Lambda function again (eg do it in a different folder or bucket), otherwise an infinite loop would occur.
When you wish to convert the CSV files to Parquet, see:
Converting to Columnar Formats - Amazon Athena
Using AWS Athena To Convert A CSV File To Parquet | CloudForecast Blog