Run spark sql query in AWS EMR - amazon-web-services

I set up an AWS EMR cluster.
I selected emr-6.0.0.
The application selected was:
Spark: Spark 2.4.4 on Hadoop 3.2.1 YARN with Ganglia 3.7.2 and Zeppelin 0.9.0-SNAPSHOT
After that i created a jupyter notebook and attached it to the cluster.
The problem is that the following lines of code in the notebook throw an error :
data_frame = spark.read.json("s3://transactions-bucket-demo/")
data_frame.createOrReplaceTempView("table")
spark.sql("SELECT * from table")
Error:
'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
How to resolve this error due to sql query in the notebook?

Related

How to run SHOW PARTITIONS on hive table using pyspark?

I am trying to run SHOW PARTITIONS on hive table using pyspark, but it is failing with the below error. I am using dataproc cluster on GCP to run pyspark job.
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
Traceback (most recent call last):
File "/tmp/test/pyspark_test.py", line 64, in <module>
sqlContext.sql(
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 433, in sql
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Table not found for 'SHOW PARTITIONS': dbname.tbname; line 1 pos 0;
'ShowPartitions
+- 'UnresolvedTable [dbname, tbname], SHOW PARTITIONS
I tried below two approaches, but none of the approach works and gives UnresolvedTable error. Please help
spark = SparkSession.builder.enableHiveSupport().appName("test").getOrCreate()
spark.sql("""SHOW PARTITIONS dbname.tbname""")
and
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
sqlContext = HiveContext(sc)
sqlContext.sql("""SHOW PARTITIONS dbname.tbname""")
After searching a bit, it seems job is unable to find hive-site.xml (not sure!). If thats the case, how to pass this file to the job?

Airflow EmrCreateJobFlowOperator `label is invalid: emr-6.8.0` Error On Latest EMR Version

EMR released a new cluster version today
But when I attempt to upgrade to the latest released EMR version using the contributed EMR create job flow operator I'm hitting
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/airflow/dags/plugins/operators/shippo_emr_operators.py", line 133, in execute
return super().execute(context)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/emr_create_job_flow.py", line 81, in execute
response = emr.create_job_flow(job_flow_overrides)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/emr.py", line 88, in create_job_flow
response = self.get_conn().run_job_flow(**config)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the RunJobFlow operation: The supplied release label is invalid: emr-6.8.0.
Looking at the EMR contribution code I don't see any hard coded values so I'm not sure why were hitting this error at this point. Has the label format changed and if so where can I find the exact string?
EDIT: The plot thickens. If I run aws emr list-release-labels I get
NextToken: AAIAAdZ_6MGjAhReZYcOrXICLpYU98iQO_ZB3kCK65qEWRH9MrJLdi_r-alVGb1AZlnFg0vsdxRUzdBLt-SyQ3TznUBM8Ncu7n94pJVQykbWe_TapxBi2WpUkcZfRAcxYgcg6TwejeaxGKcbysA89Jc9M3vIlVQetGgY1zQESS2Dq3P9vxvsOo3xxZoTqnmOVjs24Hy1hPM8zfzoUfH7MMomXkqhU5MHZ0cG3Aee5F51LtNS0_NBge399SiDYwhz1W2RB2tAjDc=
ReleaseLabels:
- emr-6.7.0
- emr-6.6.0
- emr-6.5.0
- emr-6.4.0
Which indicates that the release label has been updated in the docs but not actually released to the tooling?
EMR release the new versions in a few regions first, probably you are trying to launch a cluster in a no available region yet.

Cloud composer issue with datasets in Australia region

I was trying to use cloud composer to schedule and orchestrate Bigquery jobs. Bigquery tables are in australia-southeast1 region.The cloud composer environment was created in us-central1 region(As composer is not available in Australia region). When I try below command , it throws a vague error. The same setup worked fine when I tried with datasets residing in EU and US.
Command:
gcloud beta composer environments run bq-schedule --location us-central1 test -- my_bigquery_dag input_gl 8-02-2018
Error:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/usr/local/lib/airflow/airflow/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/airflow/airflow/bin/cli.py", line 528, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/airflow/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models.py", line 1583, in run
session=session)
File "/usr/local/lib/airflow/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models.py", line 1492, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/contrib/operators/bigquery_operator.py", line 98, in execute
self.create_disposition, self.query_params)
File "/usr/local/lib/airflow/airflow/contrib/hooks/bigquery_hook.py", line 499, in run_query
return self.run_with_configuration(configuration)
File "/usr/local/lib/airflow/airflow/contrib/hooks/bigquery_hook.py", line 868, in run_with_configuration
err.resp.status)
Exception: ('BigQuery job status check failed. Final error was: %s', 404)
Is there any workaround to resolve this issue?
Because your dataset resides in australia-southeast1, BigQuery created a job in the same location by default, which is australia-southeast1. However, the Airflow in your Composer environment was trying to get the job's status without specifying location field.
Reference: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/get
This has been fixed by my PR and it has been merged to master.
To work around this, you can extend the BigQueryCursor and override the run_with_configuration() function with location support.

Jenkins Job Builder Configuration in Python

While updating the Jobs in Jenkins Job-Builder using jenkins-jobs update I'm getting the below error.
INFO:root:Updating jobs in ['jobs'] ([])
Traceback (most recent call last):
File "/usr/bin/jenkins-jobs", line 10, in <module>
sys.exit(main())
File "/usr/lib/python2.7/site-packages/jenkins_jobs/cmd.py", line 191, in main
execute(options, config)
File "/usr/lib/python2.7/site-packages/jenkins_jobs/cmd.py", line 372, in execute
n_workers=options.n_workers)
File "/usr/lib/python2.7/site-packages/jenkins_jobs/builder.py", line 348, in update_jobs
self.load_files(input_fn)
File "/usr/lib/python2.7/site-packages/jenkins_jobs/builder.py", line 293, in load_files
self.parser.parse(in_file)
File "/usr/lib/python2.7/site-packages/jenkins_jobs/parser.py", line 128, in parse
self.parse_fp(fp)
File "/usr/lib/python2.7/site-packages/jenkins_jobs/parser.py", line 105, in parse_fp
cls, dfn = next(iter(item.items()))
AttributeError: 'str' object has no attribute 'items'
Job-Builder Version : 1.6.1
Python Version : 2.7
OS : RHEL 7.1
I've tried this in different machines but with no luck.
The AttributeError: 'str' object has no attribute 'items'error is very common in python, it will be more helpful if you share the code or atlease where the error is appearing.
You are using "Jenkins Job Builder" for configuring jenkins and you are getting error while updating jenkins jobs. The update command is used to deploy the job to jenkins after you have tested the job definition. The update command requires a configuration file.
You should pass that configuration file as it is, not in string format and also the jobs should be in non string format inside configuration file, I mean not in ' OR " single or double quotes.

"xml.sax._exceptions.SAXReaderNotAvailable: No parsers found" when run in jenkins

So I'm working towards having automated staging deployments via Jenkins and Ansible. Part of this is using a script called ec2.py from ansible in order to dynamically retrieve a list of matching servers to deploy to.
SSH-ing into the Jenkins server and running the script from the jenkins user, the script runs as expected. However, running the script from within jenkins leads to the following error:
ERROR: Inventory script (ec2/ec2.py) had an execution error: Traceback (most recent call last):
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 1262, in <module>
Ec2Inventory()
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 159, in __init__
self.do_api_calls_update_cache()
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 386, in do_api_calls_update_cache
self.get_instances_by_region(region)
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 417, in get_instances_by_region
reservations.extend(conn.get_all_instances(filters = { filter_key : filter_values }))
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/.local/lib/python2.7/site-packages/boto/ec2/connection.py", line 585, in get_all_instances
max_results=max_results)
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/.local/lib/python2.7/site-packages/boto/ec2/connection.py", line 681, in get_all_reservations
[('item', Reservation)], verb='POST')
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/.local/lib/python2.7/site-packages/boto/connection.py", line 1181, in get_list
xml.sax.parseString(body, h)
File "/usr/lib/python2.7/xml/sax/__init__.py", line 43, in parseString
parser = make_parser()
File "/usr/lib/python2.7/xml/sax/__init__.py", line 93, in make_parser
raise SAXReaderNotAvailable("No parsers found", None)
xml.sax._exceptions.SAXReaderNotAvailable: No parsers found
I don't know too much about python, so I'm not sure how to debug this issue further.
So it turns out the issue was to do with Jenkins overwriting the default LD_LIBRARY_PATH variable. By unsetting that variable before running python, I was able to make the python app work!