I have a lambda written in .net core. It will invoke a COPY command from redshift. My lambda executes under a role which has access to redshift and to s3.
My COPY command which looks like this:
COPY my_table_name FROM 's3://my_bucket/my_file.csv' CREDENTIALS 'aws_access_key_id=x;aws_secret_access_key=y' DELIMITER ',' CSV;
This works fine. My problem is that the CREDENTIALS I am using for the COPY is completely independent of the role which lambda is running under.
Is there a way to execute the COPY command using the role which the lambda is executing under?
I definitely recommend using Role-Based authentication instead of Key-Based, but I believe you will need to assign the role directly to your Redshift cluster instead of trying to pass the role that is assigned to your Lambda function. The COPY command runs from within the Redshift cluster, your Lambda function is only telling Redshift to run the command, so it can't really use the Lambda function's role.
Please see the detailed documentation on Redshift Role-Based authentication.
Related
I have 500GB data in s3 which I have to move to redshift and to do it automatically we are planning to use Lambda. But not sure if Lambda would be able to do that as it has time limit of 15 mins and as well size limit (I guess 10 gb). Could you please help us understand if Lambda can be use for transferring huge volume of data from s3 to redshift ?
Your AWS Lambda function can issue the COPY command via the execute_statement() command.
This command will continue operating without a connection, so the Lambda function can end after sending the command. The Lambda timeout is unimportant unless you specifically want to wait until it has finished to check the status.
The Amazon Redshift COPY command reads directly from an Amazon S3 bucket, so there is no need to load the data into the Lambda function.
I suggest that you first get the COPY command syntax correct by running it in the Redshift SQL console, and once it is working you could put the command in the Lambda function.
I am using a step function and it gives a JSON to the Lambda as event(object data from s3 upload). I have to check the JSON and compare 2 values in it(file name and eTag) to the data in my redshift DB. If the entry does not exist, I have to classify the file to a different bucket and add an entry to the redshift DB(versioning). Trouble is, I do not have a good idea of how I can query and update Redshift through Lambda. Can someone please give suggestions on what methods I should adopt? Thanks!
Edit: Should've mentioned the lambda is in Python
One way to achieve this use case is you can write the Lambda function by using the Java run-time API and then within the Lambda function, use a RedshiftDataClient object. Using this API, you can perform CRUD operations on a Redshift cluster.
To see examples:
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/example_code/redshift/src/main/java/com/example/redshiftdata
If you are unsure how to build a Lambda function by using the Lambda Java run-time API that can invoke AWS Services, please refer to :
Creating an AWS Lambda function that detects images with Personal Protective Equipment
This example shows you how to develop a Lambda function using the Java runtime API that invokes AWS Services. So instead of invoking Amazon S3 or Rekognition, use the RedshiftDataClient within the Lambda function to perform Redshift CRUD opertions.
I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog
I have following 2 use case to apply on this
Case 1. I would need to call the lambda alone to invoke athena to perform query on s3 data? Question: How to invoke lambda alone via api?
Case 2. I would need lambda function to invoke athena whenever a file copied to the same s3 bucket that already mapped to the athena?
Iam referring following link to do the same to perform the Lambda operation over athena
Link:
https://dev.classmethod.jp/cloud/run-amazon-athenas-query-with-aws-lambda/
For the case 2: Following are eg want to integrate:
File in s3-1 is sales.csv - and i would updating sales details by copying data from other s3-2 . And the schema/column defined in the s3-1 data would remain same.
so when i copy some file to the same s3 data that mapped to the athena, the lambda should call athena to perform the query
Appreciate if can provide the better way to achieve above cases?
Thanks
Case 1
An AWS Lambda can be directly invoked via the invoke() command. This can be done via the AWS Command-Line Interface (CLI) or from a programming language using an AWS SDK.
Case 2
An Amazon S3 event can be configured on a bucket to automatically trigger an AWS Lambda function when a file is uploaded. The event provides the bucket name and file name (object name) to the Lambda function.
The Lambda function can extract these details from the event record and can then use that information in an Amazon Athena command.
Please note that, if the file name is different each time, a CREATE TABLE command would be required before a SELECT command can query the data.
General Comments
A Lambda function can run for a maximum of 15 minutes, so make sure the Athena queries do not take more than this time. This is not a particularly efficient use of an AWS Lambda function because it will be billed for the duration of the function call, even if it is just waiting for Athena to finish.
Another option would be to have the Lambda function directly process the file, assuming that the query is not particularly complex. For example, the Lambda function could download the file to temporary storage (maximum 500MB), read through the file, do some calculations (eg add up the total of some columns), then store the results somewhere.
The next step wuold be create a end point to your lambda, you ver can use aws-apigateway for that.
On the other hand, using the amazon console or amazon cli, you can invoke the lambda in order to test.
How can I pass AWS credentials (aws_access_key and aws_secret_key) to PIG PigStorage function?
Thanks
Given this question is tagged with EMR I am going to assume you are using AWS EMR for the Hadoop cluster. If this is the case, then no further setup is required to access S3. The EMR service automatically configured Hadoop FS (which PigStorage will leverage) with either the AWS credentials of the user starting the cluster or uses the instance role requested. Just provide the S3 location and Pig will interface with S3 according to the policy and permissions of the user/role.
A = LOAD 's3://<yourbucket>/<path>/' using PigStorage('\t') as (id:int, field2:chararray, field3:chararray);
I wasn't very explicit, and did gave an example of my use case, sorry. I needed that because I needed to use two different AWS access_keys, and using something like s3n://access:secret#bucket did not solve. I solved this changing the PigStorage function , storing in hdfs the results, and on the cleanUpWithSucess method invoke a method that uploads the hdfs files to s3 with credentials. In this way I can pass the credentials to the PigStorageFunction when it is used to store, of course I also changed the constructor of the PigStorage to receive these arguments.