How to run SHOW PARTITIONS on hive table using pyspark? - google-cloud-platform

I am trying to run SHOW PARTITIONS on hive table using pyspark, but it is failing with the below error. I am using dataproc cluster on GCP to run pyspark job.
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
Traceback (most recent call last):
File "/tmp/test/pyspark_test.py", line 64, in <module>
sqlContext.sql(
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 433, in sql
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Table not found for 'SHOW PARTITIONS': dbname.tbname; line 1 pos 0;
'ShowPartitions
+- 'UnresolvedTable [dbname, tbname], SHOW PARTITIONS
I tried below two approaches, but none of the approach works and gives UnresolvedTable error. Please help
spark = SparkSession.builder.enableHiveSupport().appName("test").getOrCreate()
spark.sql("""SHOW PARTITIONS dbname.tbname""")
and
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
sqlContext = HiveContext(sc)
sqlContext.sql("""SHOW PARTITIONS dbname.tbname""")
After searching a bit, it seems job is unable to find hive-site.xml (not sure!). If thats the case, how to pass this file to the job?

Related

Run spark sql query in AWS EMR

I set up an AWS EMR cluster.
I selected emr-6.0.0.
The application selected was:
Spark: Spark 2.4.4 on Hadoop 3.2.1 YARN with Ganglia 3.7.2 and Zeppelin 0.9.0-SNAPSHOT
After that i created a jupyter notebook and attached it to the cluster.
The problem is that the following lines of code in the notebook throw an error :
data_frame = spark.read.json("s3://transactions-bucket-demo/")
data_frame.createOrReplaceTempView("table")
spark.sql("SELECT * from table")
Error:
'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
How to resolve this error due to sql query in the notebook?

Cloud composer issue with datasets in Australia region

I was trying to use cloud composer to schedule and orchestrate Bigquery jobs. Bigquery tables are in australia-southeast1 region.The cloud composer environment was created in us-central1 region(As composer is not available in Australia region). When I try below command , it throws a vague error. The same setup worked fine when I tried with datasets residing in EU and US.
Command:
gcloud beta composer environments run bq-schedule --location us-central1 test -- my_bigquery_dag input_gl 8-02-2018
Error:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/usr/local/lib/airflow/airflow/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/airflow/airflow/bin/cli.py", line 528, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/airflow/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models.py", line 1583, in run
session=session)
File "/usr/local/lib/airflow/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/airflow/airflow/models.py", line 1492, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/contrib/operators/bigquery_operator.py", line 98, in execute
self.create_disposition, self.query_params)
File "/usr/local/lib/airflow/airflow/contrib/hooks/bigquery_hook.py", line 499, in run_query
return self.run_with_configuration(configuration)
File "/usr/local/lib/airflow/airflow/contrib/hooks/bigquery_hook.py", line 868, in run_with_configuration
err.resp.status)
Exception: ('BigQuery job status check failed. Final error was: %s', 404)
Is there any workaround to resolve this issue?
Because your dataset resides in australia-southeast1, BigQuery created a job in the same location by default, which is australia-southeast1. However, the Airflow in your Composer environment was trying to get the job's status without specifying location field.
Reference: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/get
This has been fixed by my PR and it has been merged to master.
To work around this, you can extend the BigQueryCursor and override the run_with_configuration() function with location support.

Unable to debug airflow error.Trying to use airflow to build data pipeline on GCP

I am trying to identify what might be causing the below issue (Airflow)?
Basically I have written a Test DAG and it's main task is to read data from BigQuery and write it into a new table.I tried searching about this but I am not able to find out what might be causing this. I'm not even sure if my gcp_connection is working correctly. I don't know how to test this.
Any help is greatly appreciated!
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/airflow/models.py", line 1659, in _run_raw_task
result = task_copy.execute(context=context)
File "/anaconda3/lib/python3.6/site-packages/airflow/operators/subdag_operator.py", line 103, in execute
executor=self.executor)
File "/anaconda3/lib/python3.6/site-packages/airflow/models.py", line 4214, in run
job.run()
File "/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 203, in run
self._execute()
File "/anaconda3/lib/python3.6/site-packages/airflow/utils/db.py", line 74, in wrapper
return func(*args, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 2547, in _execute
raise AirflowException(err)
airflow.exceptions.AirflowException: ---------------------------------------------------
Some task instances failed:
{('test_oscope.test_oscope', 'create_if_not_exists', datetime.datetime(2016, 6, 1, 0, 0, tzinfo=<Timezone [UTC]>), 1), ('test_oscope.test_oscope', 'fill', datetime.datetime(2016, 6, 1, 0, 0, tzinfo=<Timezone [UTC]>), 1)}
Hooks in Airflow are very useful on their own to use in interactive environments like iPython or Jupyter Notebook.
For example:
from airflow.contrib.hooks.gcs_hook import GoogleCloudStorageHook
GCSHook = GoogleCloudStorageHook(google_cloud_storage_conn_id='google_conn_id')
GCSHook.get_conn() # This will check if your GCP connection is working correctly or not

'No such file or directory' error after submitting a training job

I execute:
gcloud beta ml jobs submit training ${JOB_NAME} --config config.yaml
and after about 5 minutes the job errors out with this error:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 232, in <module> tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 228, in main run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 129, in run_training data_sets = input_data.read_data_sets(FLAGS.train_dir, FLAGS.fake_data)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 212, in read_data_sets with open(local_file, 'rb') as f: IOError: [Errno 2] No such file or directory: 'gs://my-bucket/mnist/train/train-images.gz'
The strange thing is, as far as I can tell, that file exists at that url.
This error usually indicates you are using a multi-region GCS bucket for your output. To avoid this error you should use a regional GCS bucket. Regional buckets provide stronger consistency guarantees which are needed to avoid these types of errors.
For more information about properly setting up GCS buckets for Cloud ML please refer to the Cloud ML Docs
Normal IO does not know how to deal with GCS gs:// correctly. You need:
first_data_file = args.train_files[0]
file_stream = file_io.FileIO(first_data_file, mode='r')
# run experiment
model.run_experiment(file_stream)
But ironically, you can move files from the gs://bucket to your root directory, which your programs CAN then actually see:
with file_io.FileIO(gs://presentation_mplstyle_path, mode='r') as input_f:
with file_io.FileIO('presentation.mplstyle', mode='w+') as output_f:
output_f.write(input_f.read())
mpl.pyplot.style.use(['./presentation.mplstyle'])
And finally, moving a file from your root back to a gs://bucket:
with file_io.FileIO(report_name, mode='r') as input_f:
with file_io.FileIO(job_dir + '/' + report_name, mode='w+') as output_f:
output_f.write(input_f.read())
Should be easier IMO.

NotSupportedError when trying to build primary index in N1QL in Couchbase Python SDK

I'm trying to get into the new N1QL Queries for Couchbase in Python.
I got my database set up in Couchbase 4.0.0.
My initial try was to retreive all documents like this:
from couchbase.bucket import Bucket
bucket = Bucket('couchbase://localhost/dafault')
rv = bucket.n1ql_query('CREATE PRIMARY INDEX ON default').execute()
for row in bucket.n1ql_query('SELECT * FROM default'):
print row
But this produces a OperationNotSupportedError:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 2357, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1777, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/my_user/python_tests/test_n1ql.py", line 9, in <module>
rv = bucket.n1ql_query('CREATE PRIMARY INDEX ON default').execute()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/couchbase/n1ql.py", line 215, in execute
for _ in self:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/couchbase/n1ql.py", line 235, in __iter__
self._start()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/couchbase/n1ql.py", line 180, in _start
self._mres = self._parent._n1ql_query(self._params.encoded)
couchbase.exceptions.NotSupportedError: <RC=0x13[Operation not supported], Couldn't schedule n1ql query, C Source=(src/n1ql.c,82)>
Here the version numbers of everything I use:
Couchbase Server: 4.0.0
couchbase python library: 2.0.2
cbc: 2.5.1
python: 2.7.8
gcc: 4.2.1
Anyone an idea what might have went wrong here? I could not find any solution to this problem up to now.
There was another ticket for node.js where the same issue happened. There was a proposal to enable n1ql for the specific bucket first. Is this also needed in python?
It would seem you didn't configure any cluster nodes with the Query or Index services. As such, the error returned is one that indicates no nodes are available.
I also got similar error while trying to create primary index.
Create a primary index...
Traceback (most recent call last):
File "post-upgrade-test.py", line 45, in <module>
mgr.n1ql_index_create_primary(ignore_exists=True)
File "/usr/local/lib/python2.7/dist-packages/couchbase/bucketmanager.py", line 428, in n1ql_index_create_primary
'', defer=defer, primary=True, ignore_exists=ignore_exists)
File "/usr/local/lib/python2.7/dist-packages/couchbase/bucketmanager.py", line 412, in n1ql_index_create
return IxmgmtRequest(self._cb, 'create', info, **options).execute()
File "/usr/local/lib/python2.7/dist-packages/couchbase/_ixmgmt.py", line 160, in execute
return [x for x in self]
File "/usr/local/lib/python2.7/dist-packages/couchbase/_ixmgmt.py", line 144, in __iter__
self._start()
File "/usr/local/lib/python2.7/dist-packages/couchbase/_ixmgmt.py", line 132, in _start
self._cmd, index_to_rawjson(self._index), **self._options)
couchbase.exceptions.NotSupportedError: <RC=0x13[Operation not supported], Couldn't schedule ixmgmt operation, C Source=(src/ixmgmt.c,98)>
Adding query and index node to the cluster solved the issue.