Is there any documentation regarding how to authenticate with google cloud storage through an airflow dag?
I am trying to use airflow's PostgresToGoogleCloudStorageOperator to upload the results from a query to google cloud storage. However I am getting the below error.
{gcp_api_base_hook.py:146} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
{taskinstance.py:1150} ERROR - 403 POST https://storage.googleapis.com/upload/storage/v1/b/my_test_bucket123/o?uploadType=multipart: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>)
I have configured "gcp_conn_id" in the airflow UI and provided the json from my service account key but still getting this error.
Below is the entire dag file
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.contrib.operators.postgres_to_gcs_operator import PostgresToGoogleCloudStorageOperator
DESTINATION_BUCKET = 'test'
DESTINATION_DIRECTORY = "uploads"
dag_params = {
'dag_id': 'upload_test',
'start_date': datetime(2020, 7, 7),
'schedule_interval': '#once',
}
with DAG(**dag_params) as dag:
upload = PostgresToGoogleCloudStorageOperator(
task_id="upload",
bucket=DESTINATION_BUCKET,
filename=DESTINATION_DIRECTORY + "/{{ execution_date }}" + "/{}.json",
sql='''SELECT * FROM schema.table;''',
postgres_conn_id="postgres_default",
gcp_conn_id="gcp_default"
)
Sure. There is this very nice documentation with examples for Google Connection:
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/connections/gcp.html
Also GCP has this very nice overview of all the authentication methods you can use:
https://cloud.google.com/endpoints/docs/openapi/authentication-method
Related
I'm trying to connect to a spreadsheet, and read the cells with Airflow so I can eventually add this to a database. For now I've created a Google Sheets document, and have authenticated via the Google Client in my browser, however when I run the code it results in the following error: ""User must be authenticated when user project is provided".
Details: "\[{'#type': 'type.googleapis.com/google.rpc.ErrorInfo',
'reason': 'USER_PROJECT_DENIED',
'domain': 'googleapis.com',
'metadata': {'service': 'sheets.googleapis.com',
'consumer': 'projects/User-2023-01-19'}}\]"\>
Here's the code I'm running after running and authenticating via " gcloud auth application-default login"
from airflow.providers.google.suite.hooks.sheets import GSheetsHook
from airflow.operators.python import PythonOperator
def get_data_from_spreadsheet():
hook = GSheetsHook(
)
spreadsheet = hook.get_values('-Sheets key-','B3' )
get_data_from_spreadsheet()
I've tried running the code block above, expecting to get the data posted in the sheet, and instead get an error.
I'm trying to pick a JSON file from a Cloud Storage bucket and dump it into BigQuery using Apache Airflow, however, I'm getting the following error:
Access Denied: BigQuery BigQuery: Missing required OAuth scope. Need BigQuery or Cloud Platform write scope.
This is my code:
from datetime import timedelta, datetime
import json
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators.mysql_to_gcs import MySqlToGoogleCloudStorageOperator
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.bigquery_check_operator import BigQueryCheckOperator
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
default_args = {
'owner': 'airflow',
'depends_on_past': True,
#'start_date': seven_days_ago,
'start_date': datetime(2022, 11, 1),
'email': ['uzair.zafar#gmail.pk'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=5),
}
with DAG('checking_airflow',
default_args=default_args,
description='dag to start the logging of data in logging table',
schedule_interval='#daily',
start_date=datetime(2022, 11, 1),
) as dag:
dump_csv_to_temp_table = GCSToBigQueryOperator(
task_id='gcs_to_bq_load',
google_cloud_storage_conn_id='gcp_connection',
bucket='airflow-dags',
#filename='users/users.csv',
source_objects='users/users0.json',
#schema_object='schemas/users.json',
source_format='NEWLINE_DELIMITED_JSON',
create_disposition='CREATE_IF_NEEDED',
destination_project_dataset_table='project.supply_chain.temporary_users',
write_disposition='WRITE_TRUNCATE',
dag=dag,
)
dump_csv_to_temp_table
Please assist me to solve this issue.
Could you please share more details. Are you using Cloud Composer to perform this task?
If the environment is running there should be a default google connection that you can use. Go to Airflow UI >> Admin >> Connections and you should see there google_cloud_default.
In composer if you don't specify the connection it will use the default one for interacting with google cloud ressources.
I have a gRPC service deployed on Google Cloud Run which I want to call from Composer.
I have assigned the roles/iam.serviceAccountTokenCreator role to the service account which my composer worker nodes are running under, and I'm not mounting any custom service key files or setting the GOOGLE_APPLICATION_CREDENTIALS environment variable.
Using the JWT_GOOGLE authentication option in the airflow gRPC hook I get the following error:
[2022-05-31 14:20:16,082] {grpc.py:90} INFO - Calling gRPC service
[2022-05-31 14:20:16,097] {taskinstance.py:1152} ERROR - 'Credentials' object has no attribute 'signer_email'
Traceback (most recent call last):
File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 985, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/providers/grpc/operators/grpc.py", line 95, in execute
for response in responses:
File "/usr/local/lib/airflow/airflow/providers/grpc/hooks/grpc.py", line 136, in run
with self.get_conn() as channel:
File "/usr/local/lib/airflow/airflow/providers/grpc/hooks/grpc.py", line 104, in get_conn
jwt_creds = google_auth_jwt.OnDemandCredentials.from_signing_credentials(credentials)
File "/opt/python3.6/lib/python3.6/site-packages/google/auth/jwt.py", line 695, in from_signing_credentials
kwargs.setdefault("issuer", credentials.signer_email)
AttributeError: 'Credentials' object has no attribute 'signer_email'
[2022-05-31 14:20:16,100] {taskinstance.py:1196} INFO - Marking task as FAILED. dag_id=example_dag, task_id=example_task, execution_date=20220531T135709, start_date=20220531T142015, end_date=20220531T142016
[2022-05-31 14:20:23,826] {local_task_job.py:102} INFO - Task exited with return code 1
Does anyone have any idea how/why my credentials aren't including the field I need?
Found a solution to this after discussing with Google Cloud - essentially, it looks like the JWT_GOOGLE authentication method isn't set up for GCE service accounts so I went down the CUSTOM authentication route instead:
import google.auth.transport.grpc
import google.auth.transport.requests
import google.oauth2.credentials
import google.oauth2.id_token
from airflow.providers.grpc.operators.grpc import GrpcOperator
def connection_func(conn):
"""Custom connection function for gRPC authentication.
Args:
conn: Airflow Connection object
Returns:
An instantiated gRPC channel for making calls to our remote service.
"""
request = google.auth.transport.requests.Request()
if not str(conn.host).startswith("https://"):
audience = f"https://{conn.host}"
else:
audience = conn.host
token = google.oauth2.id_token.fetch_id_token(request, audience)
creds = google.oauth2.credentials.Credentials(token)
base_url = conn.host
if conn.port:
base_url = f"{base_url}:{conn.port}"
channel = google.auth.transport.grpc.secure_authorized_channel(
creds, None, base_url
)
return channel
return GrpcOperator(
...
custom_connection_func=connection_func,
)
This uses the approach seen here to fetch an ID token for a given audience, then create a set of credentials from there and finally instantiate the gRPC secure channel for use in the operator.
I'm trying to run a container with Cloud Run as a task of an Airflow's DAG.
Seems that there are no things like a CloudRunOperator or similar and I can't find anything on documentations (either Cloud Run and Airflow one).
Have someone ever dealt with this problem?
If yes, how can I run a container with Cloud Run and handle xcom?
Thanks in advance!!
AFAIK when a container is deployed to Cloud Run it automatically listens possible requests to be sent. See document for reference.
Instead you can send a request to access the deployed container. You can do this by using the code below.
This DAG has three tasks print_token, task_get_op and process_data.
print_token prints the identity token needed to authenticate the requests to your deployed Cloud Run container. I used "xcom_pull" get the output of "BashOperator" and assign the authentication token to token so this could be used to authenticate to HTTP request that you will perform.
task_get_op performs a GET on the connection cloud_run (this just contains the Cloud Run endpoint) and defined headers 'Authorization': 'Bearer ' + token for the authentication.
process_data performs "xcom_pull" on "task_get_op" to get the output and print it using PythonOperator.
import datetime
import airflow
from airflow.operators import bash
from airflow.operators import python
from airflow.providers.http.operators.http import SimpleHttpOperator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'composer_http_request',
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
print_token = bash.BashOperator(
task_id='print_token',
bash_command='gcloud auth print-identity-token "--audiences=https://hello-world-fri824-ab.c.run.app"' # The end point of the deployed Cloud Run container
)
token = "{{ task_instance.xcom_pull(task_ids='print_token') }}" # gets output from 'print_token' task
task_get_op = SimpleHttpOperator(
task_id='get_op',
method='GET',
http_conn_id='cloud_run',
headers={'Authorization': 'Bearer ' + token },
)
def process_data_from_http(**kwargs):
ti = kwargs['ti']
http_data = ti.xcom_pull(task_ids='get_op')
print(http_data)
process_data = python.PythonOperator(
task_id='process_data_from_http',
python_callable=process_data_from_http,
provide_context=True
)
print_token >> task_get_op >> process_data
cloud_run connection:
Output (graph):
print_token logs:
task_get_op logs:
process_data logs (output from GET):
NOTE: I'm using Cloud Composer 1.17.7 and Airflow 2.0.2 and installed apache-airflow-providers-http to be able to use the SimpleHttpOperator.
I'm trying to use Dialogflow's detect_intent in Python and I keep getting:
404 com.google.apps.framework.request.NotFoundException: Agent metadata not found for agentId: ####-####-####-####-####
Here's a snippet of my code:
import google.cloud.dialogflow as dialogflow
from CONFIG import DIALOGFLOW_PROJECT_ID
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'credentials/dialogflow.json'
def predict_intent(text, language):
session_client = dialogflow.SessionsClient()
session = session_client.session_path(DIALOGFLOW_PROJECT_ID, SESSION_ID)
text_input = dialogflow.TextInput(text=text, language_code=language)
query_input = dialogflow.QueryInput(text=text_input)
response = session_client.detect_intent(session=session, query_input=query_input) # ERROR
return response.query_result.intent.display_name
I tried running the function multiple times and some of them succeed, but most fall in the exception.
I can train the bot using the same interface and it works fine.
I'm using Python 3.7 and the following Google Cloud modules: google-api-core==2.0.1, google-auth==2.0.2, google-cloud-dialogflow==2.7.1, googleapis-common-protos==1.53.0.