Creating Connection for RedshiftDataOperator - amazon-web-services

So i when to the airflow documentation for aws redshift there is 2 operator that can execute the sql query they are RedshiftSQLOperator and RedshiftDataOperator. I already implemented my job using RedshiftSQLOperator but i want to do it using RedshiftDataOperator instead, because i dont want to using postgres connection in RedshiftSQLOperator but AWS API.
RedshiftDataOperator Documentation
I had read this documentation there is aws_conn_id in the parameter. But when im trying to use the same connection id there is error.
[2023-01-11, 04:55:56 UTC] {base.py:68} INFO - Using connection ID 'redshift_default' for task execution.
[2023-01-11, 04:55:56 UTC] {base_aws.py:206} INFO - Credentials retrieved from login
[2023-01-11, 04:55:56 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/redshift_data.py", line 146, in execute
self.statement_id = self.execute_query()
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/redshift_data.py", line 124, in execute_query
resp = self.hook.conn.execute_statement(**filter_values)
File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 415, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 745, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnrecognizedClientException) when calling the ExecuteStatement operation: The security token included in the request is invalid.
From task id
redshift_data_task = RedshiftDataOperator(
task_id='redshift_data_task',
database='rds',
region='ap-southeast-1',
aws_conn_id='redshift_default',
sql="""
call some_procedure();
"""
)
What should i fill in the airflow connection ? Because in the documentation there is no example of value that i should fill to airflow. Thanks
Airflow RedshiftDataOperator Connection Required Value

Have you tried using the Amazon Redshift connection? There is both an option for authenticating using your Redshift credentials:
Connection ID: redshift_default
Connection Type: Amazon Redshift
Host: <your-redshift-endpoint> (for example, redshift-cluster-1.123456789.us-west-1.redshift.amazonaws.com)
Schema: <your-redshift-database> (for example, dev, test, prod, etc.)
Login: <your-redshift-username> (for example, awsuser)
Password: <your-redshift-password>
Port: <your-redshift-port> (for example, 5439)
(source)
and an option for using an IAM role (there is an example in the first link).
Disclaimer: I work at Astronomer :)
EDIT: Tested the following with Airflow 2.5.0 and Amazon provider 6.2.0:
Added the IP of my Airflow instance to the VPC security group with "All traffic" access.
Airflow Connection with the connection id aws_default, Connection type "Amazon Web Services", extra: { "aws_access_key_id": "<your-access-key-id>", "aws_secret_access_key": "<your-secret-access-key>", "region_name": "<your-region-name>" }. All other fields blank. I used a root key for my toy-aws. If you use other credentials you need to make sure that IAM role has access and the right permissions to the Redshift cluster (there is a list in the link above).
Operator code:
red = RedshiftDataOperator(
task_id="red",
database="dev",
sql="SELECT * FROM dev.public.users LIMIT 5;",
cluster_identifier="redshift-cluster-1",
db_user="awsuser",
aws_conn_id="aws_default"
)

Related

Getting error while testing AWS Lambda function: "Invalid database identifier"

Hi I'm getting error while testing lambda function like:
{
"errorMessage": "An error occurred (InvalidParameterValue) when calling the DescribeDBInstances operation: Invalid database identifier: <RDS instance id>",
"errorType": "ClientError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 25, in lambda_handler\n db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']\n",
" File \"/var/runtime/botocore/client.py\", line 391, in _api_call\n return self._make_api_call(operation_name, kwargs)\n",
" File \"/var/runtime/botocore/client.py\", line 719, in _make_api_call\n raise error_class(parsed_response, operation_name)\n"
]
}
AND here is my lambda code :
import json
import boto3
import logging
import os
#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)
#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')
def lambda_handler(event, context):
#log input event
LOGGER.info("RdsAutoRestart Event Received, now checking if event is eligible. Event Details ==> ", event)
#Input event from the SNS topic originated from RDS event notifications
snsMessage = json.loads(event['Records'][0]['Sns']['Message'])
rdsInstanceId = snsMessage['Source ID']
stepFunctionInput = {"rdsInstanceId": rdsInstanceId}
rdsEventId = snsMessage['Event ID']
#Retrieve RDS instance ARN
db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
db_instance = db_instances[0]
rdsInstanceArn = db_instance['DBInstanceArn']
# Filter on the Auto Restart RDS Event. Event code: RDS-EVENT-0154.
if 'RDS-EVENT-0154' in rdsEventId:
#log input event
LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")
#Verify that instance is tagged with auto-restart-protection tag. The tag is used to classify instances that are required to be terminated once started.
tagCheckPass = 'false'
rdsInstanceTags = rdsClient.list_tags_for_resource(ResourceName=rdsInstanceArn)
for rdsInstanceTag in rdsInstanceTags["TagList"]:
if 'auto-restart-protection' in rdsInstanceTag["Key"]:
if 'yes' in rdsInstanceTag["Value"]:
tagCheckPass = 'true'
#log instance tags
LOGGER.info("RdsAutoRestart verified that the instance is tagged auto-restart-protection = yes, now starting the Step Functions Flow")
else:
tagCheckPass = 'false'
#log instance tags
LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")
if 'true' in tagCheckPass:
#Initialise StepFunctions Client
stepFunctionsClient = boto3.client('stepfunctions')
# Start StepFunctions WorkFlow
# StepFunctionsArn is stored in an environment variable
stepFunctionsArn = os.environ['STEPFUNCTION_ARN']
stepFunctionsResponse = stepFunctionsClient.start_execution(
stateMachineArn= stepFunctionsArn,
name=event['Records'][0]['Sns']['MessageId'],
input= json.dumps(stepFunctionInput)
)
else:
LOGGER.info("RdsAutoRestart Event detected, and event is not eligible")
return {
'statusCode': 200
}
I'm trying to Stop an Amazon RDS database which starts automatically after 7 days. I'm following this AWS document: Field Notes: Stopping an Automatically Started Database Instance with Amazon RDS | AWS Architecture Blog
Can anyone help me?
The error message is saying: Invalid database identifier: <RDS instance id>"
It seems to be coming from this line:
db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
The error message is saying that the rdsInstanceId variable contains <RDS instance id>, which seems to be an example value rather than a real value.
In looking at the code on Field Notes: Stopping an Automatically Started Database Instance with Amazon RDS | AWS Architecture Blog, it is asking you to create a test event that includes this message:
"Message": "{\"Event Source\":\"db-instance\",\"Event Time\":\"2020-07-09 15:15:03.031\",\"Identifier Link\":\"https://console.aws.amazon.com/rds/home?region=<region>#dbinstance:id=<RDS instance id>\",\"Source ID\":\"<RDS instance id>\",\"Event ID\":\"http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_Events.html#RDS-EVENT-0154\",\"Event Message\":\"DB instance started\"}",
If you look closely at that line, it includes this part to identify the Amazon RDS instance:
dbinstance:id=<RDS instance id>
I think that you are expected to modify the provided test event to fill-in your own values for anything in <angled brackets> (such as the Instance Id of your Amazon RDS instance).

GCP| Composer Dataproc submit job| Auth credential not found

I am running a GCP composer cluster on GKE. I am defining a DAG to submit a job to dataproc cluster. I have read GCP doc, and it says that Composer's service account will get used by the workers to send the dataproc api requests.
But DataprocSubmitJobOperator reports error in getting the auth credentials.
Stack trace below. Composer env info attached.
I need suggestion to fix this issue.
[2022-08-23, 16:03:25 UTC] {taskinstance.py:1448} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=harshit.bapna#dexterity.ai
AIRFLOW_CTX_DAG_ID=dataproc_spark_operators
AIRFLOW_CTX_TASK_ID=pyspark_task
AIRFLOW_CTX_EXECUTION_DATE=2022-08-23T16:03:16.986859+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-08-23T16:03:16.986859+00:00
[2022-08-23, 16:03:25 UTC] {dataproc.py:1847} INFO - Submitting job
[2022-08-23, 16:03:25 UTC] {credentials_provider.py:312} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2022-08-23, 16:03:25 UTC] {taskinstance.py:1776} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/operators/dataproc.py", line 1849, in execute
job_object = self.hook.submit_job(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
return func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/dataproc.py", line 869, in submit_job
client = self.get_job_client(region=region)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/dataproc.py", line 258, in get_job_client
credentials=self._get_credentials(), client_info=CLIENT_INFO, client_options=client_options
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 261, in _get_credentials
credentials, _ = self._get_credentials_and_project_id()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 240, in _get_credentials_and_project_id
credentials, project_id = get_credentials_and_project_id(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 321, in get_credentials_and_project_id
return _CredentialProvider(*args, **kwargs).get_credentials_and_project()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 229, in get_credentials_and_project
credentials, project_id = self._get_credentials_using_adc()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 307, in _get_credentials_using_adc
credentials, project_id = google.auth.default(scopes=self.scopes)
File "/opt/python3.8/lib/python3.8/site-packages/google/auth/_default.py", line 459, in default
credentials, project_id = checker()
File "/opt/python3.8/lib/python3.8/site-packages/google/auth/_default.py", line 221, in _get_explicit_environ_credentials
credentials, project_id = load_credentials_from_file(
File "/opt/python3.8/lib/python3.8/site-packages/google/auth/_default.py", line 107, in load_credentials_from_file
raise exceptions.DefaultCredentialsError(
google.auth.exceptions.DefaultCredentialsError: File celery was not found.
[2022-08-23, 16:03:25 UTC] {taskinstance.py:1279} INFO - Marking task as UP_FOR_RETRY. dag_id=dataproc_spark_operators, task_id=pyspark_task, execution_date=20220823T160316, start_date=20220823T160324, end_date=20220823T160325
[2022-08-23, 16:03:25 UTC] {standard_task_runner.py:93} ERROR - Failed to execute job 32837 for task pyspark_task (File celery was not found.; 356144)
[2022-08-23, 16:03:26 UTC] {local_task_job.py:154} INFO - Task exited with return code 1
[2022-08-23, 16:03:26 UTC] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check
GCP Composer Env
Based on the error File celery was not found, I think that the Application Default Credentials (ADC) tries to read a file named celery, and it doesn't find it, so check if you set the environment variable GOOGLE_APPLICATION_CREDENTIALS, because if you set it, ADC will read the the file to use it:
If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC uses the service account key or configuration file that the variable points to.
If the environment variable GOOGLE_APPLICATION_CREDENTIALS isn't set, ADC uses the service account that is attached to the resource that is running your code.
This service account might be a default service account provided by Compute Engine, Google Kubernetes Engine, App Engine, Cloud Run, or Cloud Functions. It might also be a user-managed service account that you created.
GCP doc

How to send an email in AWS MWAA (Apache Airflow) using EmailOperator

I am working with AWS MWAA (Apache Airflow). I want to send an email in MWAA upon completion of my pipeline.I have set the following configuration
Now when I run my dag using an Email Operator, it gives me an error.
File "/usr/lib64/python3.7/socket.py", line 707, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/usr/lib64/python3.7/socket.py", line 752, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
[2022-05-19, 11:11:21 UTC] {{local_task_job.py:154}} INFO - Task exited with return code 1
[2022-05-19, 11:11:21 UTC] {{local_task_job.py:264}} INFO - 0 downstream tasks scheduled from follow-on schedule check
Then I changed my configuration to
It now gives the following error
File "/usr/lib64/python3.7/smtplib.py", line 642, in auth
raise SMTPAuthenticationError(code, resp)
smtplib.SMTPAuthenticationError: (530, b'Must issue a STARTTLS command first')
[2022-05-19, 12:22:39 UTC] {{local_task_job.py:154}} INFO - Task exited with return code 1
[2022-05-19, 12:22:39 UTC] {{local_task_job.py:264}} INFO - 0 downstream tasks scheduled from follow-on schedule check
Can you please tell me where I am doing wrong or how should I configure this to send an email to a particular email address from any domain?
Your smtp host variable is an email address and not a host.
It should be smtp.gmail.com not smtp#gmail.com
You've hopefully also changed your password as you have shared it publicly in that screenshot and anyone could use it now.

az sql server ad-admin create fails in azure devops with puython error

the command is pretty vanilla:
az sql server ad-admin create --display-name 'some group' --object-id 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' --resource-group my-group --server my-server
The command works when i run it in a terminal, and other az commands run in the script, but when the script hits this line - no matter where i place it - i get the following error message.
Any ideas?
ERROR: create_or_update() missing 1 required positional argument: 'parameters'
2020-04-09T22:11:13.3286506Z Traceback (most recent call last):
2020-04-09T22:11:13.3287125Z File "/opt/az/lib/python3.6/site-packages/knack/cli.py", line 206, in invoke
2020-04-09T22:11:13.3287519Z cmd_result = self.invocation.execute(args)
2020-04-09T22:11:13.3288177Z File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 608, in execute
…2020-04-09T22:11:13.3294117Z return …T22:11:13.3294770Z File "/opt/az/lib/python3.6/site-packages/azure/cli/core/__init__.py", line 493, in default_command_handler
2020-04-09T22:11:13.3295184Z return op(**command_args)
2020-04-09T22:11:13.3295845Z File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/sql/custom.py", line 2074, in server_ad_admin_set
2020-04-09T22:11:13.3296258Z properties=kwargs)
2020-04-09T22:11:13.3296834Z TypeError: create_or_update() missing 1 required positional argument: 'parameters'
Try these commands in different lines. It might fix the error or can help you identify which command has the error(s) specifically.
az sql server ad-admin create --display-name 'some group'
az sql server ad-admin create --object-id 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
az sql server ad-admin create --resource-group my-group
az sql server ad-admin create --server my-server
You can also try removing ' and ' and add = in between command name and value.
Here is one approach:
az sql server ad-admin create --display-name=some group --object-id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx --resource-group=my-group --server=my-server

EndpointConnectionError: Could not connect to the endpoint URL: "http://169.254.169.254/....."

I am trying to create AWS RDS and deploy lambda function using a python script. However, I am getting below error, looks like it is unable to communicate with the aws commands to create rds.
DEBUG: Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/meta-data/iam/security-credentials/: Could not connect to the endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/botocore/utils.py", line 303, in _get_request
response = self._session.send(request.prepare())
File "/usr/lib/python2.7/site-packages/botocore/httpsession.py", line 282, in send raise EndpointConnectionError(endpoint_url=request.url, error=e)
EndpointConnectionError: Could not connect to the endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"
I am getting the aws credentials through SSO okta. In the ~/.aws directory,below are the contents of 'credentials' and 'config' file respectively.
[default]
aws_access_key_id = <Key Id>
aws_secret_access_key = <Secret Key>
aws_session_token = <Token>
[default]
region = us-west-2
```python
```
for az in availability_zones:
if aurora.get_db_instance(db_instance_identifier + "-" + az)[0] != 0:
aurora.create_db_instance(db_cluster_identifier, db_instance_identifier + "-" + az, az, subnet_group_identifier, db_instance_type)
else:
aurora.modify_db_instance(db_cluster_identifier, db_instance_identifier + "-" + az, az, db_instance_type)
# Wait for DB to become available for connection
iter_max = 15
iteration = 0
for az in availability_zones:
while aurora.get_db_instance(db_instance_identifier + "-" + az)[1]["DBInstances"][0]["DBInstanceStatus"] != "available":
iteration += 1
if iteration < iter_max:
logging.info("Waiting for DB instances to become available - iteration " + str(iteration) + " of " + str(iter_max))
time.sleep(10*iteration)
else:
raise Exception("Waiting for DB Instance to become available timed out!")
cluster_endpoint = aurora.get_db_cluster(db_cluster_identifier)[1]["DBClusters"][0]["Endpoint"]
The actual error below, coming from the while loop, DEBUG shows unable to locate credential, but the credential is there. I can deploy an Elastic Beanstalk environment from cli using the same aws credential, but not this. Looks like the above aurora.create_db_instance command failed.
DEBUG: Unable to locate credentials
Traceback (most recent call last):
File "./deploy_api.py", line 753, in <module> sync_rds()
File "./deploy_api.py", line 57, in sync_rds
while aurora.get_db_instance(db_instance_identifier + "-" + az)[1]["DBInstances"][0]["DBInstanceStatus"] != "available":
TypeError: 'NoneType' object has no attribute '__getitem__'
I had this error because an ECS task didn't have permissions to write to DynamoDB. The code causing the problem was:
from boto3 import resource
dynamodb_resource = resource("dynamodb")
The problem was resolved when I filled in the region_name, aws_access_key_id and aws_secret_access_key parameters for the resource() function call.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.resource
If this doesn't solve your problem then check your code that connects to AWS services and make sure that you are filling in all of the proper function parameters.