Submit a pyspark job with a config file on Dataproc - google-cloud-platform

I'm newbie on GCP and I'm struggling with submitting pyspark job in Dataproc.
I have a python script depends on a config.yaml file. And I notice that when I submit the job everything is executed under /tmp/.
How can I make available that config file in the /tmp/ folder?
At the moment, I get this error:
12/22/2020 10:12:27 AM root INFO Read config file.
Traceback (most recent call last):
File "/tmp/job-test4/train.py", line 252, in <module>
run_training(args)
File "/tmp/job-test4/train.py", line 205, in run_training
with open(args.configfile, "r") as cf:
FileNotFoundError: [Errno 2] No such file or directory: 'gs://network-spark-migrate/model/demo-config.yml'
Thanks in advance

Below a snippet worked for me:
gcloud dataproc jobs submit pyspark gs://network-spark-migrate/model/train.py --cluster train-spark-demo --region europe-west6 --files=gs://network-spark-migrate/model/demo-config.yml -- --configfile ./demo-config.yml

Related

Permission denied: '/models/default'

I follow this page
https://cloud.google.com/vertex-ai/docs/export/export-model-tabular
I trained the model on the Google Cloud Platform console Then exported the model per the instructions. However when I run the docker run command I get the following:
sudo docker run -v /home/grgcp8787/feature-model/model-2101347040586891264/tf-saved-model/emotion-feature:/models/default -p 8080:8080 -it europe-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server-v1:latestUnable to find image 'europe-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server-v1:latest' locally
Trying to pull repository europe-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server-v1 ...
latest: Pulling from europe-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server-v1
Digest: sha256:6b2ac764102278efa467daccc80acbcdcf119e3d7be079225a6d69ba5be7e8c5
Status: Downloaded newer image for europe-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server-v1:latest
ERROR:root:Caught exception: [Errno 13] Permission denied: '/models/default'
Traceback (most recent call last):
File "/google3/third_party/py/cloud_ml_autoflow/prediction/launcher.py", line 311, in main
if is_ucaip_model():
File "/google3/third_party/py/cloud_ml_autoflow/prediction/launcher.py", line 279, in is_ucaip_model
contents = os.listdir(local_model_artifact_path())
PermissionError: [Errno 13] Permission denied: '/models/default'
Is my google cloud CLI not be initialized?
I am not sure what I did wrong, or what I need to change to fix it.
Thank you for your help, in advance.

In AWS EMR Jupyter Notebook, how to change the user from livy to hadoop

I have created a AWS EMR Cluster and uploaded,
sparkify_log_small.json
And created a EMR Jupyter Notebook with below code thinking it would read from user(hadoop) home directory.
sparkify_log_data = "sparkify_log_small.json"
df = spark.read.json(sparkify_log_data)
df.persist()
df.head(5)
But when submit the code, i get the below error.
'Path does not exist: hdfs://ip-172-31-50-58.us-west-2.compute.internal:8020/user/livy/sparkify_log_small.json;'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Path does not exist: hdfs://ip-172-31-50-58.us-west-2.compute.internal:8020/user/livy/sparkify_log_small.json;'
From googling got to know that YARN default user is livy. How can i change the user in the jupyter notebook from livy to hadoop (or) point to the right directory.
I have tried creating a folder like below and copying file from /home/hadoop/sparkify_log_small.json to /home/livy/sparkify_log_small.json
but did not work.
Here basically i am trying to read a file from ec2-master from notebook.
Below procedures resolved it,
Checked hadoop files
hadoop fs -ls
Created folder in hadoop filesystem
hdfs dfs -mkdir /home
hdfs dfs -mkdir /home/hadoop
Copied file to that location
hadoop fs -put ./sparkify_log_small.json /home/hadoop/sparkify_log_small.json
Then ran the python code in jupyter cell. It worked.

ImportError: No module named idlelib" when running Google Dataflow worker

I have a python 2.7 script I run locally to launch a Apache Beam / Google Dataflow job (SDK 2.12.0). The job takes a csv file from a Google storage bucket, processes it and then creates an entity in Google Datastore for each row. The script ran fine for years ...but now it is failing:
INFO:root:2019-05-15T22:07:11.481Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO:root:2019-05-15T21:47:13.370Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 773, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 489, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 280, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 410, in load_session
module = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 827, in _import_module
return __import__(import_name)
ImportError: No module named idlelib
I believe this error is happening at the worker level (not locally). I don't make reference to it in my script. To make sure it wasn't me I have installed updates for all google-cloud packages, apache-beam[gcp] etc locally -just in case. I tried importing idlelib into my script I get the same error. Any suggestions?
It has been fine for years and started failing from SDK 2.12.0 release.
What was the last release that this script succeeding on? 2.11?

boto3 throws error in when packaged under rpm

I am using boto3 in my project and when i package it as rpm it is raising error while initializing ec2 client.
<class 'botocore.exceptions.DataNotFoundError'>:Unable to load data for: _endpoints. Traceback -Traceback (most recent call last):
File "roboClientLib/boto/awsDRLib.py", line 186, in _get_ec2_client
File "boto3/__init__.py", line 79, in client
File "boto3/session.py", line 200, in client
File "botocore/session.py", line 789, in create_client
File "botocore/session.py", line 682, in get_component
File "botocore/session.py", line 809, in get_component
File "botocore/session.py", line 179, in <lambda>
File "botocore/session.py", line 475, in get_data
File "botocore/loaders.py", line 119, in _wrapper
File "botocore/loaders.py", line 377, in load_data
DataNotFoundError: Unable to load data for: _endpoints
Can anyone help me here. Probably boto3 requires some run time resolutions which it not able to get this in rpm.
I tried with using LD_LIBRARY_PATH in /etc/environment which is not working.
export LD_LIBRARY_PATH="/usr/lib/python2.6/site-packages/boto3:/usr/lib/python2.6/site-packages/boto3-1.2.3.dist-info:/usr/lib/python2.6/site-packages/botocore:
I faced the same issue:
botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-04-01/service-2
For which I figured out the directory was missing. Updating botocore by running the following solved my issue:
pip install --upgrade botocore
Botocore depends on a set of service definition files that it uses to generate clients on the fly. Boto3 further depends on another set of files that it uses to generate resource clients. You will need to include these in any installs of boto3 or botocore. The files will need to be located in the 'data' folder of the root of the respective library.
I faced similar issue which was due to old version of botocore. Once I updated it, it started working.
Please consider using below command.
pip install --upgrade botocore
Also please ensure, you have setup boto configuration profile.
Boto searches credentials in below order.
Passing credentials as parameters in the boto.client() method
Passing credentials as parameters when creating a Session object
Environment variables
Shared credential file (~/.aws/credentials)
AWS config file (~/.aws/config)
Assume Role provider
Boto2 config file (/etc/boto.cfg and ~/.boto)
Instance metadata service on an Amazon EC2 instance that has an IAM
role configured.

Unable to launch spark cluster on aws using spark-ec2 script -"AWS was not able to validate the provided access credentials"

I have tried both the commands below and did set the env variables prior to launch of the scripts, but I am hit with "AWS was not able to validate the provided access credentials" error. I don't think there is an issue with keys.
I would appreciate any sort help to fix this.
I am on ubuntu t2.micro instance.
https://spark.apache.org/docs/latest/ec2-scripts.html
export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESS_KEY_ID=
./spark-ec2 -k admin-key1 -i /home/ubuntu/admin-key1.pem -s 3 launch my-spark-cluster
./spark-ec2 --key-pair=admin-key1 --identity-file=/home/ubuntu/admin-key1.pem --region=ap-southeast-2 --zone=ap-southeast-2a launch my-spark-cluster
AuthFailure
AWS was not able to validate the provided access credentials
Traceback (most recent call last):
File "./spark_ec2.py", line 1465, in <module>
main()
File "./spark_ec2.py", line 1457, in main
real_main()
File "./spark_ec2.py", line 1277, in real_main
opts.zone = random.choice(conn.get_all_zones()).name
File "/cskmohan/spark-1.4.1/ec2/lib/boto-2.34.0/boto/ec2/connection.py", line 1759, in get_all_zones
[('item', Zone)], verb='POST')
File "/cskmohan/spark-1.4.1/ec2/lib/boto-2.34.0/boto/connection.py", line 1182, in get_list
raise self.ResponseError(response.status, response.reason, body)
boto.exception.EC2ResponseError: EC2ResponseError: 401 Unauthorized