Cloud composer issue with datasets in Australia region - google-cloud-platform

I was trying to use cloud composer to schedule and orchestrate Bigquery jobs. Bigquery tables are in australia-southeast1 region.The cloud composer environment was created in us-central1 region(As composer is not available in Australia region). When I try below command , it throws a vague error. The same setup worked fine when I tried with datasets residing in EU and US.
Command:
gcloud beta composer environments run bq-schedule --location us-central1 test -- my_bigquery_dag input_gl 8-02-2018
Error:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/usr/local/lib/airflow/airflow/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/airflow/airflow/bin/cli.py", line 528, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/airflow/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models.py", line 1583, in run
session=session)
File "/usr/local/lib/airflow/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models.py", line 1492, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/contrib/operators/bigquery_operator.py", line 98, in execute
self.create_disposition, self.query_params)
File "/usr/local/lib/airflow/airflow/contrib/hooks/bigquery_hook.py", line 499, in run_query
return self.run_with_configuration(configuration)
File "/usr/local/lib/airflow/airflow/contrib/hooks/bigquery_hook.py", line 868, in run_with_configuration
err.resp.status)
Exception: ('BigQuery job status check failed. Final error was: %s', 404)
Is there any workaround to resolve this issue?

Because your dataset resides in australia-southeast1, BigQuery created a job in the same location by default, which is australia-southeast1. However, the Airflow in your Composer environment was trying to get the job's status without specifying location field.
Reference: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/get
This has been fixed by my PR and it has been merged to master.
To work around this, you can extend the BigQueryCursor and override the run_with_configuration() function with location support.

Related

Airflow EmrCreateJobFlowOperator `label is invalid: emr-6.8.0` Error On Latest EMR Version

EMR released a new cluster version today
But when I attempt to upgrade to the latest released EMR version using the contributed EMR create job flow operator I'm hitting
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/airflow/dags/plugins/operators/shippo_emr_operators.py", line 133, in execute
return super().execute(context)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/emr_create_job_flow.py", line 81, in execute
response = emr.create_job_flow(job_flow_overrides)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/emr.py", line 88, in create_job_flow
response = self.get_conn().run_job_flow(**config)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the RunJobFlow operation: The supplied release label is invalid: emr-6.8.0.
Looking at the EMR contribution code I don't see any hard coded values so I'm not sure why were hitting this error at this point. Has the label format changed and if so where can I find the exact string?
EDIT: The plot thickens. If I run aws emr list-release-labels I get
NextToken: AAIAAdZ_6MGjAhReZYcOrXICLpYU98iQO_ZB3kCK65qEWRH9MrJLdi_r-alVGb1AZlnFg0vsdxRUzdBLt-SyQ3TznUBM8Ncu7n94pJVQykbWe_TapxBi2WpUkcZfRAcxYgcg6TwejeaxGKcbysA89Jc9M3vIlVQetGgY1zQESS2Dq3P9vxvsOo3xxZoTqnmOVjs24Hy1hPM8zfzoUfH7MMomXkqhU5MHZ0cG3Aee5F51LtNS0_NBge399SiDYwhz1W2RB2tAjDc=
ReleaseLabels:
- emr-6.7.0
- emr-6.6.0
- emr-6.5.0
- emr-6.4.0
Which indicates that the release label has been updated in the docs but not actually released to the tooling?
EMR release the new versions in a few regions first, probably you are trying to launch a cluster in a no available region yet.

How to pass args and template_fields to dataproc from airflow 1

I am trying to execute python code on a dataproc cluster via airflow orchestration.
I am using airflow 1.10.12, and DataprocWorkflowTemplateInstantiateInlineOperator to instanciate a dataproc cluster & pass some parameters (and templated params aswell). The main objective is to run some prediction code.
Note that I upgraded this code so use airflow.providers.google.cloud.operators.dataproc and not airflow.contrib.operators.dataproc_operator to import DataprocInstantiateInlineWorkflowTemplateOperator, because the former introduced the parameters kwarg that, in theory, permits passing arguments to the cluster. Using the later in my other scripts, I have no errors, but I cannot introduce parameters to the cluster.
...
from airflow.providers.google.cloud.operators.dataproc import (
DataprocInstantiateInlineWorkflowTemplateOperator,
)
...
workflow_seg_members = make_workflow_template(
region=REGION,
dataproc_job_bucket=DATAPROC_JOB_BUCKET,
python_main_executable_path="segmentation_members/seg_members_prediction.py",
)
op_seg_members_prediction = DataprocInstantiateInlineWorkflowTemplateOperator(
task_id="seg_members_prediction",
project_id=PROJECT_ID,
region=REGION,
template=workflow_seg_members,
parameters={
"execution_date_str": "{{ds_nodash}}",
"project": "<REDACTED>",
"dataset": "<REDACTED>",
"features_table_prefix": "global_features",
"output_table_prefix": "seg_members_output",
"path_to_model": "segmentation_members/DecisionTreeClassifier.pkl",
"bucket_name": "<REDACTED>",
"model_designation": "Segmentation Members",
},
)
in seg_members_prediction.py, I use argparse to create the needed arguments.
The error I am getting is :
TypeError: Parameter to MergeFrom() must be instance of same class: expected google.cloud.dataproc.v1beta2.OrderedJob got str.
My questions are :
How do I fix this MergeFrom() exception?
Is this the right approach to pass parameters from airflow to my dataproc cluster?
Here is the complete stack :
File "/usr/local/lib/airflow/airflow/providers/google/common/hooks/base_google.py", line 373, in inner_wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/airflow/airflow/providers/google/cloud/hooks/dataproc.py", line 712, in instantiate_inline_workflow_template
metadata=metadata,
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/dataproc_v1beta2/gapic/workflow_template_service_client.py", line 488, in instantiate_inline_workflow_template
request_id=request_id,
TypeError: Parameter to MergeFrom() must be instance of same class: expected google.cloud.dataproc.v1beta2.OrderedJob got str.
[2022-06-13 12:36:34,696] {taskinstance.py:1197} INFO - Marking task as UP_FOR_RETRY. dag_id=ds_seg_members_integration, task_id=seg_members_prediction, execution_date=20220610T162234, start_date=20220613T123629, end_date=20220613T123634
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/usr/local/lib/airflow/airflow/bin/airflow", line 37, in <module>
args.func(args)
File "/usr/local/lib/airflow/airflow/utils/cli.py", line 76, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/bin/cli.py", line 561, in run
_run(args, dag, ti)
File "/usr/local/lib/airflow/airflow/bin/cli.py", line 480, in _run
pool=args.pool,
File "/usr/local/lib/airflow/airflow/utils/db.py", line 74, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 986, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/providers/google/cloud/operators/dataproc.py", line 1748, in execute
metadata=self.metadata,
File "/usr/local/lib/airflow/airflow/providers/google/common/hooks/base_google.py", line 373, in inner_wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/airflow/airflow/providers/google/cloud/hooks/dataproc.py", line 712, in instantiate_inline_workflow_template
metadata=metadata,
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/dataproc_v1beta2/gapic/workflow_template_service_client.py", line 488, in instantiate_inline_workflow_template
request_id=request_id,
TypeError: Parameter to MergeFrom() must be instance of same class: expected google.cloud.dataproc.v1beta2.OrderedJob got str.
EDIT :
I tried running the following code but still got the same error :
op_seg_members_prediction = DataprocInstantiateInlineWorkflowTemplateOperator(
task_id="seg_members_prediction",
project_id=PROJECT_ID,
region=REGION,
template=workflow_seg_members,
)
op_seg_members_prediction.execute(context="DEBUG")
After browsing the apache-airflow-providers-google doc I discovered it isn't compatible with airflow 1.
You can install this package on top of an existing Airflow 2
installation (see Requirements below for the minimum Airflow version
supported) via pip install apache-airflow-providers-google
The package supports the following python versions: 3.7,3.8,3.9,3.10
So I upgraded to airflow 2 in my local environment, and the MergeFrom() error stopped occurring.
This still begs the question : How do you pass parameters from airflow to dataproc for airflow 1

Apache Airflow 1.9 : Dataflow exception at the job's end

I am using Airflow 1.9 to launch a Dataflow on Google Cloud Platform (GCP) thanks to a DataflowJavaOperator.
Below, the code used to launch dataflow from an Airflow Dag :
df_dispatch_data = DataFlowJavaOperator(
task_id='df-dispatch-data', # Equivalent to JobName
jar="/path/of/my/dataflow/jar",
gcp_conn_id="my_connection_id",
dataflow_default_options={
'project': my_project_id,
'zone': 'europe-west1-b',
'region': 'europe-west1',
'stagingLocation': 'gs://my-bucket/staging',
'tempLocation': 'gs://my-bucket/temp'
},
options={
'workerMachineType': 'n1-standard-1',
'diskSizeGb': '50',
'numWorkers': '1',
'maxNumWorkers': '50',
'schemaBucket': 'schemas_needed_to_dispatch',
'autoscalingAlgorithm': 'THROUGHPUT_BASED',
'readQuery': 'my_query'
}
)
However, even if all is right on GCP because the job succeed, an exception occured at the end of the dataflow job on my compute Airflow. It is thrown by the gcp_dataflow_hook.py :
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 528, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1584, in run
session=session)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1493, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/dataflow_operator.py", line 121, in execute
hook.start_java_dataflow(self.task_id, dataflow_options, self.jar)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 152, in start_java_dataflow
task_id, variables, dataflow, name, ["java", "-jar"])
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 146, in _start_dataflow
self.get_conn(), variables['project'], name).wait_for_done()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 31, in __init__
self._job = self._get_job()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 48, in _get_job
job = self._get_job_id_from_name()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 40, in _get_job_id_from_name
for job in jobs['jobs']:
KeyError: 'jobs'
Have you got an idea ?
This issue is caused by options used to launch the dataflow. If --zone or --region are given the google API to get job status does not work, only if default zone and regions, US/us-central1.

'No such file or directory' error after submitting a training job

I execute:
gcloud beta ml jobs submit training ${JOB_NAME} --config config.yaml
and after about 5 minutes the job errors out with this error:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 232, in <module> tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 228, in main run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 129, in run_training data_sets = input_data.read_data_sets(FLAGS.train_dir, FLAGS.fake_data)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 212, in read_data_sets with open(local_file, 'rb') as f: IOError: [Errno 2] No such file or directory: 'gs://my-bucket/mnist/train/train-images.gz'
The strange thing is, as far as I can tell, that file exists at that url.
This error usually indicates you are using a multi-region GCS bucket for your output. To avoid this error you should use a regional GCS bucket. Regional buckets provide stronger consistency guarantees which are needed to avoid these types of errors.
For more information about properly setting up GCS buckets for Cloud ML please refer to the Cloud ML Docs
Normal IO does not know how to deal with GCS gs:// correctly. You need:
first_data_file = args.train_files[0]
file_stream = file_io.FileIO(first_data_file, mode='r')
# run experiment
model.run_experiment(file_stream)
But ironically, you can move files from the gs://bucket to your root directory, which your programs CAN then actually see:
with file_io.FileIO(gs://presentation_mplstyle_path, mode='r') as input_f:
with file_io.FileIO('presentation.mplstyle', mode='w+') as output_f:
output_f.write(input_f.read())
mpl.pyplot.style.use(['./presentation.mplstyle'])
And finally, moving a file from your root back to a gs://bucket:
with file_io.FileIO(report_name, mode='r') as input_f:
with file_io.FileIO(job_dir + '/' + report_name, mode='w+') as output_f:
output_f.write(input_f.read())
Should be easier IMO.

"xml.sax._exceptions.SAXReaderNotAvailable: No parsers found" when run in jenkins

So I'm working towards having automated staging deployments via Jenkins and Ansible. Part of this is using a script called ec2.py from ansible in order to dynamically retrieve a list of matching servers to deploy to.
SSH-ing into the Jenkins server and running the script from the jenkins user, the script runs as expected. However, running the script from within jenkins leads to the following error:
ERROR: Inventory script (ec2/ec2.py) had an execution error: Traceback (most recent call last):
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 1262, in <module>
Ec2Inventory()
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 159, in __init__
self.do_api_calls_update_cache()
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 386, in do_api_calls_update_cache
self.get_instances_by_region(region)
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 417, in get_instances_by_region
reservations.extend(conn.get_all_instances(filters = { filter_key : filter_values }))
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/.local/lib/python2.7/site-packages/boto/ec2/connection.py", line 585, in get_all_instances
max_results=max_results)
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/.local/lib/python2.7/site-packages/boto/ec2/connection.py", line 681, in get_all_reservations
[('item', Reservation)], verb='POST')
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/.local/lib/python2.7/site-packages/boto/connection.py", line 1181, in get_list
xml.sax.parseString(body, h)
File "/usr/lib/python2.7/xml/sax/__init__.py", line 43, in parseString
parser = make_parser()
File "/usr/lib/python2.7/xml/sax/__init__.py", line 93, in make_parser
raise SAXReaderNotAvailable("No parsers found", None)
xml.sax._exceptions.SAXReaderNotAvailable: No parsers found
I don't know too much about python, so I'm not sure how to debug this issue further.
So it turns out the issue was to do with Jenkins overwriting the default LD_LIBRARY_PATH variable. By unsetting that variable before running python, I was able to make the python app work!