How to pass args and template_fields to dataproc from airflow 1 - google-cloud-platform

I am trying to execute python code on a dataproc cluster via airflow orchestration.
I am using airflow 1.10.12, and DataprocWorkflowTemplateInstantiateInlineOperator to instanciate a dataproc cluster & pass some parameters (and templated params aswell). The main objective is to run some prediction code.
Note that I upgraded this code so use airflow.providers.google.cloud.operators.dataproc and not airflow.contrib.operators.dataproc_operator to import DataprocInstantiateInlineWorkflowTemplateOperator, because the former introduced the parameters kwarg that, in theory, permits passing arguments to the cluster. Using the later in my other scripts, I have no errors, but I cannot introduce parameters to the cluster.
...
from airflow.providers.google.cloud.operators.dataproc import (
DataprocInstantiateInlineWorkflowTemplateOperator,
)
...
workflow_seg_members = make_workflow_template(
region=REGION,
dataproc_job_bucket=DATAPROC_JOB_BUCKET,
python_main_executable_path="segmentation_members/seg_members_prediction.py",
)
op_seg_members_prediction = DataprocInstantiateInlineWorkflowTemplateOperator(
task_id="seg_members_prediction",
project_id=PROJECT_ID,
region=REGION,
template=workflow_seg_members,
parameters={
"execution_date_str": "{{ds_nodash}}",
"project": "<REDACTED>",
"dataset": "<REDACTED>",
"features_table_prefix": "global_features",
"output_table_prefix": "seg_members_output",
"path_to_model": "segmentation_members/DecisionTreeClassifier.pkl",
"bucket_name": "<REDACTED>",
"model_designation": "Segmentation Members",
},
)
in seg_members_prediction.py, I use argparse to create the needed arguments.
The error I am getting is :
TypeError: Parameter to MergeFrom() must be instance of same class: expected google.cloud.dataproc.v1beta2.OrderedJob got str.
My questions are :
How do I fix this MergeFrom() exception?
Is this the right approach to pass parameters from airflow to my dataproc cluster?
Here is the complete stack :
File "/usr/local/lib/airflow/airflow/providers/google/common/hooks/base_google.py", line 373, in inner_wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/airflow/airflow/providers/google/cloud/hooks/dataproc.py", line 712, in instantiate_inline_workflow_template
metadata=metadata,
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/dataproc_v1beta2/gapic/workflow_template_service_client.py", line 488, in instantiate_inline_workflow_template
request_id=request_id,
TypeError: Parameter to MergeFrom() must be instance of same class: expected google.cloud.dataproc.v1beta2.OrderedJob got str.
[2022-06-13 12:36:34,696] {taskinstance.py:1197} INFO - Marking task as UP_FOR_RETRY. dag_id=ds_seg_members_integration, task_id=seg_members_prediction, execution_date=20220610T162234, start_date=20220613T123629, end_date=20220613T123634
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/usr/local/lib/airflow/airflow/bin/airflow", line 37, in <module>
args.func(args)
File "/usr/local/lib/airflow/airflow/utils/cli.py", line 76, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/bin/cli.py", line 561, in run
_run(args, dag, ti)
File "/usr/local/lib/airflow/airflow/bin/cli.py", line 480, in _run
pool=args.pool,
File "/usr/local/lib/airflow/airflow/utils/db.py", line 74, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 986, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/providers/google/cloud/operators/dataproc.py", line 1748, in execute
metadata=self.metadata,
File "/usr/local/lib/airflow/airflow/providers/google/common/hooks/base_google.py", line 373, in inner_wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/airflow/airflow/providers/google/cloud/hooks/dataproc.py", line 712, in instantiate_inline_workflow_template
metadata=metadata,
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/dataproc_v1beta2/gapic/workflow_template_service_client.py", line 488, in instantiate_inline_workflow_template
request_id=request_id,
TypeError: Parameter to MergeFrom() must be instance of same class: expected google.cloud.dataproc.v1beta2.OrderedJob got str.
EDIT :
I tried running the following code but still got the same error :
op_seg_members_prediction = DataprocInstantiateInlineWorkflowTemplateOperator(
task_id="seg_members_prediction",
project_id=PROJECT_ID,
region=REGION,
template=workflow_seg_members,
)
op_seg_members_prediction.execute(context="DEBUG")
After browsing the apache-airflow-providers-google doc I discovered it isn't compatible with airflow 1.
You can install this package on top of an existing Airflow 2
installation (see Requirements below for the minimum Airflow version
supported) via pip install apache-airflow-providers-google
The package supports the following python versions: 3.7,3.8,3.9,3.10
So I upgraded to airflow 2 in my local environment, and the MergeFrom() error stopped occurring.
This still begs the question : How do you pass parameters from airflow to dataproc for airflow 1

Related

Can't connect superset to dremio

I am running apache-superset using docker-compose by following the instructions here (https://superset.apache.org/docs/installation/installing-superset-using-docker-compose/) using docker-compose-non-dev.yml.
I have also added sqlalchemy-dremio to superset/docker/requirements-local.txt, in order to add dremio support as mentioned here (https://superset.apache.org/docs/databases/docker-add-drivers)
For dremio, I have a seperate container running on dremio/dremio-oss image using
docker run -p 9047:9047 -p 31010:31010 -p 45678:45678 -p 32010:32010 dremio/dremio-oss
and then made an account in dremio using the web interface at localhost:9047
But when I try to add dremio as a database in superset I get the following errors
on pressing test connection I get the following error
The connection string I'm using is
dremio+flight://dremio:dremio123#host.docker.internal:32010/dremio;SSL=0
At first I thought it might be a network error or an error in dremio, but I can connect to dremio using the python script here https://github.com/dremio-hub/arrow-flight-client-examples/blob/main/python/example.py
python example.py -host host.docker.internal -query 'SELECT 1'
This script runs successfully from both outside the container from host_os using localhost and from inside the superset_app container using host.docker.internal as host. Therefore I don't think its a network configuration problem, also this confirms that the sqlalchemy-dremio package was installed properly inside the superset containers.
Here is the docker logs for this error from superset_app container
2022-09-30 16:34:09,635:WARNING:superset.views.base:SupersetErrorsException
Traceback (most recent call last):
File "/app/superset/databases/commands/test_connection.py", line 123, in run
raise DBAPIError(None, None, None)
sqlalchemy.exc.DBAPIError: (builtins.NoneType) None
(Background on this error at: https://sqlalche.me/e/14/dbapi)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/usr/local/lib/python3.8/site-packages/flask_appbuilder/security/decorators.py", line 89, in wraps
return f(self, *args, **kwargs)
File "/app/superset/views/base_api.py", line 114, in wraps
raise ex
File "/app/superset/views/base_api.py", line 111, in wraps
duration, response = time_function(f, self, *args, **kwargs)
File "/app/superset/utils/core.py", line 1572, in time_function
response = func(*args, **kwargs)
File "/app/superset/utils/log.py", line 244, in wrapper
value = f(*args, **kwargs)
File "/app/superset/views/base_api.py", line 84, in wraps
return f(self, *args, **kwargs)
File "/app/superset/databases/api.py", line 708, in test_connection
TestConnectionDatabaseCommand(item).run()
File "/app/superset/databases/commands/test_connection.py", line 148, in run
raise DatabaseTestConnectionFailedError(errors) from ex
superset.databases.commands.exceptions.DatabaseTestConnectionFailedError: [SupersetError(message='(builtins.NoneType) None\n(Background on this error at: https://sqlalche.me/e/14/dbapi)', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'Dremio', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
***************
['UID=dremio', 'PWD=dremio123', 'HOST=host.docker.internal', 'PORT=32010', 'Schema=dremio', 'SSL=0']
***************
Ensure you are installing the latest version of sqlalchemy_dremio. You may need to install from source as setup.py wasn't updated accordingly (around time of writing). You will also need to add some SQLAlchemy base functions to sqlalchemy_dremio. Have a look at the following issue: https://github.com/narendrans/sqlalchemy_dremio/issues/20

sam build --use-container failed mounting directory

There was no issue in building the project a little while back, but it started throwing below error.
RuntimeError: Container does not exist. Cannot get logs for this
container
Normally this happens when docker cannot mount the shared directory, but in this case even adding the lambda directory manually in the docker interface didn't help!
Complete debug log of sam build --use-container
Building function 'SAListManagerUrlLambda'
Fetching lambci/lambda:build-python3.7 Docker container image......
Mounting C:\Users\xxxx\xxxx\xxxx\xxxx\functions\xxxx-xxxx\xxxx-xxxx as /tmp/samcli/source:ro,delegated inside runtime container
Container was not created. Skipping deletion
Sending Telemetry: {'metrics': [{'commandRun': {'awsProfileProvided': False, 'debugFlagProvided': True, 'region': '', 'commandName': 'sam build', 'duration': 1292, 'exitReason': 'RuntimeError', 'exitCode': 255, 'requestId': 'cbfcd29c-16ae-xxxx-xxxx-b9ffec8de75a', 'installationId': 'fece8ccc-cb84-xxxx-xxxx-ac72820ef0c3', 'sessionId': 'e1cbc287-1850-xxxx-xxxx-3a235769f7fb', 'executionEnvironment': 'CLI', 'pyversion': '3.7.6', 'samcliVersion': '0.53.0'}}]}
HTTPSConnectionPool(host='aws-serverless-tools-telemetry.us-west-2.amazonaws.com', port=443): Read timed out. (read timeout=0.1)
Traceback (most recent call last):
File "D:\obj\windows-release\37amd64_Release\msi_python\zip_amd64\runpy.py", line 193, in _run_module_as_main
File "D:\obj\windows-release\37amd64_Release\msi_python\zip_amd64\runpy.py", line 85, in _run_code
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\__main__.py", line 12, in <module>
cli(prog_name="sam")
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\click\core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\click\core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\click\decorators.py", line 73, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\lib\telemetry\metrics.py", line 96, in wrapped
raise exception # pylint: disable=raising-bad-type
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\lib\telemetry\metrics.py", line 62, in wrapped
return_value = func(*args, **kwargs)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\commands\build\command.py", line 129, in cli
mode,
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\commands\build\command.py", line 194, in do_cli
artifacts = builder.build()
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\lib\build\app_builder.py", line 117, in build
function.metadata)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\lib\build\app_builder.py", line 271, in _build_function
options)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\lib\build\app_builder.py", line 369, in _build_function_on_container
container.wait_for_logs(stdout=stdout_stream, stderr=stderr_stream)
File "C:\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\local\docker\container.py", line 197, in wait_for_logs
raise RuntimeError("Container does not exist. Cannot get logs for this container")
RuntimeError: Container does not exist. Cannot get logs for this container
In my case the reason was different, Action Center's Focus Assist was set to Alarms Only.
This caused the share directory notification to fail, causing the build failure.
So, make sure your Focus Assist is set to OFF.
It seems that many situations can trigger the same error. For more information the --debug option can be used like this:
sam build --use-container --debug
I see that you are using it, because you got extra information like this:
Sending Telemetry: {'metrics': [{'commandRun': {'awsProfileProvided': False, 'debugFlagProvided': True, 'region': '', 'commandName': 'sam build', 'duration': 1292, 'exitReason': 'RuntimeError', 'exitCode': 255, 'requestId': 'cbfcd29c-16ae-xxxx-xxxx-b9ffec8de75a', 'installationId': 'fece8ccc-cb84-xxxx-xxxx-ac72820ef0c3', 'sessionId': 'e1cbc287-1850-xxxx-xxxx-3a235769f7fb', 'executionEnvironment': 'CLI', 'pyversion': '3.7.6', 'samcliVersion': '0.53.0'}}]}
HTTPSConnectionPool(host='aws-serverless-tools-telemetry.us-west-2.amazonaws.com', port=443): Read timed out. (read timeout=0.1)
Traceback (most recent call last):
In my case I did suppose that the error was sending the telemetry.
My guess is that somehow the build process need pass the region. In my case it is not us-west-2.
Anyway, I disabled it as specified in the documentation and it now works.
In my case local disk in my cloud9 was almost full, so I had to delete some docker images that comes pre-installed with cloud9.
To remove an image use
docker rmi Image
This will clear up space and your build will not fail the next time.

Does boto3 v1.9.244 support creating an 's3' resource?

I am attempting to retrieve a list of files from S3 with a specific prefix using an AWS Lambda. I bundle the Lambda with boto3-1.9.244 (the latest version). When I run the Lambda, I receive a SyntaxError on the S3 resource assignment although it could have something to do with Boto3 session.
I'm using Python 3.6 and AWS Lambda uses boto3-1.9.221 and botocore-1.12.221. When I run the code without bundling the latest version of boto3, it works. My current solution is to simply bundle boto3-1.9.221 with the lambda code rather than the latest version of boto3.
import boto3
s3 = boto3.resource('s3')
I expect it to create an s3 resource, but I get this error:
invalid syntax (_base.py, line 414): SyntaxError
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 20, in lambda_handler
s3 = boto3.resource('s3')
File "/var/task/boto3/__init__.py", line 100, in resource
return _get_default_session().resource(*args, **kwargs)
File "/var/task/boto3/session.py", line 389, in resource
aws_session_token=aws_session_token, config=config)
File "/var/task/boto3/session.py", line 263, in client
aws_session_token=aws_session_token, config=config)
File "/var/task/botocore/session.py", line 839, in create_client
client_config=config, api_version=api_version)
File "/var/task/botocore/client.py", line 80, in create_client
cls = self._create_client_class(service_name, service_model)
File "/var/task/botocore/client.py", line 110, in _create_client_class
base_classes=bases)
File "/var/task/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/var/task/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/var/task/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/var/task/boto3/utils.py", line 61, in _handler
module = import_module(module)
File "/var/task/boto3/utils.py", line 52, in import_module
__import__(name)
File "/var/task/boto3/s3/inject.py", line 15, in <module>
from boto3.s3.transfer import create_transfer_manager
File "/var/task/boto3/s3/transfer.py", line 127, in <module>
from s3transfer.exceptions import RetriesExceededError as \
File "/var/task/s3transfer/__init__.py", line 134, in <module>
import concurrent.futures
File "/var/task/concurrent/futures/__init__.py", line 8, in <module>
from concurrent.futures._base import (FIRST_COMPLETED,
File "/var/task/concurrent/futures/_base.py", line 414
raise exception_type, self._exception, self._traceback
^
SyntaxError: invalid syntax
Yes, it does support. So this issue is not related to the API version.
You can access a specific API version just by replacing latest with the version number you want in the URL.
Latest
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#bucket
1.9.244
https://boto3.amazonaws.com/v1/documentation/api/1.9.244/reference/services/s3.html#bucket
Turns out the issue was that I was installing the requirements with Python2 rather than Python3. By installing the requirements with Python3, I no longer received a syntax error.
It looks like your lambda function doesn't have the IAM role for S3. You may specify the Access key and Secret key to the resource directly,
resource = boto3.resource(
's3',
# Hard coded strings as credentials, not recommended.
aws_access_key_id='AKIAIO5FODNN7E******', # not real
aws_secret_access_key='ABCDEF+c2L7yXeGvUyrPgYsDnWRRC1AYE******' # not real
)
or have to give the right permission to the lambda function.

Cloud composer issue with datasets in Australia region

I was trying to use cloud composer to schedule and orchestrate Bigquery jobs. Bigquery tables are in australia-southeast1 region.The cloud composer environment was created in us-central1 region(As composer is not available in Australia region). When I try below command , it throws a vague error. The same setup worked fine when I tried with datasets residing in EU and US.
Command:
gcloud beta composer environments run bq-schedule --location us-central1 test -- my_bigquery_dag input_gl 8-02-2018
Error:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/usr/local/lib/airflow/airflow/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/airflow/airflow/bin/cli.py", line 528, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/airflow/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models.py", line 1583, in run
session=session)
File "/usr/local/lib/airflow/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models.py", line 1492, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/contrib/operators/bigquery_operator.py", line 98, in execute
self.create_disposition, self.query_params)
File "/usr/local/lib/airflow/airflow/contrib/hooks/bigquery_hook.py", line 499, in run_query
return self.run_with_configuration(configuration)
File "/usr/local/lib/airflow/airflow/contrib/hooks/bigquery_hook.py", line 868, in run_with_configuration
err.resp.status)
Exception: ('BigQuery job status check failed. Final error was: %s', 404)
Is there any workaround to resolve this issue?
Because your dataset resides in australia-southeast1, BigQuery created a job in the same location by default, which is australia-southeast1. However, the Airflow in your Composer environment was trying to get the job's status without specifying location field.
Reference: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/get
This has been fixed by my PR and it has been merged to master.
To work around this, you can extend the BigQueryCursor and override the run_with_configuration() function with location support.

Apache Airflow 1.9 : Dataflow exception at the job's end

I am using Airflow 1.9 to launch a Dataflow on Google Cloud Platform (GCP) thanks to a DataflowJavaOperator.
Below, the code used to launch dataflow from an Airflow Dag :
df_dispatch_data = DataFlowJavaOperator(
task_id='df-dispatch-data', # Equivalent to JobName
jar="/path/of/my/dataflow/jar",
gcp_conn_id="my_connection_id",
dataflow_default_options={
'project': my_project_id,
'zone': 'europe-west1-b',
'region': 'europe-west1',
'stagingLocation': 'gs://my-bucket/staging',
'tempLocation': 'gs://my-bucket/temp'
},
options={
'workerMachineType': 'n1-standard-1',
'diskSizeGb': '50',
'numWorkers': '1',
'maxNumWorkers': '50',
'schemaBucket': 'schemas_needed_to_dispatch',
'autoscalingAlgorithm': 'THROUGHPUT_BASED',
'readQuery': 'my_query'
}
)
However, even if all is right on GCP because the job succeed, an exception occured at the end of the dataflow job on my compute Airflow. It is thrown by the gcp_dataflow_hook.py :
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 528, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1584, in run
session=session)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1493, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/dataflow_operator.py", line 121, in execute
hook.start_java_dataflow(self.task_id, dataflow_options, self.jar)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 152, in start_java_dataflow
task_id, variables, dataflow, name, ["java", "-jar"])
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 146, in _start_dataflow
self.get_conn(), variables['project'], name).wait_for_done()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 31, in __init__
self._job = self._get_job()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 48, in _get_job
job = self._get_job_id_from_name()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 40, in _get_job_id_from_name
for job in jobs['jobs']:
KeyError: 'jobs'
Have you got an idea ?
This issue is caused by options used to launch the dataflow. If --zone or --region are given the google API to get job status does not work, only if default zone and regions, US/us-central1.