i am new to EMR. i have tried to parse a 500 gb file in spark .
i setup a spark cluster in EMR with 32 nodes with spark and hadoop and zeppelin installed. the file is in the s3. when i tried with smaller file it works well (same code and same s3 location) but with the 500 gb file i have got the following error. can someone help me?
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4268307265600611017.py", line 367, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4268307265600611017.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 3, in <module>
File "/usr/lib/spark/python/pyspark/rdd.py", line 1073, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/usr/lib/spark/python/pyspark/rdd.py", line 1064, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/usr/lib/spark/python/pyspark/rdd.py", line 935, in fold
vals = self.mapPartitions(func).collect()
File "/usr/lib/spark/python/pyspark/rdd.py", line 834, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
328, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, ip-10-0-12-220.ec2.internal, executor 19):
java.lang.RuntimeException: java.io.FileNotFoundException:
/mnt1/yarn/usercache/zeppelin/filecache/13/
__spark_libs__8980026775986932451.zip
/hadoop-common-2.8.4-amzn-1.jar (No such file or directory)
at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2854)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2696)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2579)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1257)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1229)
... 30 more
ERROR
Took 1 min 28 sec. Last updated by anonymous at October 23 2018, 6:00:06 PM.
Related
I'm getting this error when running a Spark cluster with YARN using AWS EMR service:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "process_ecommerce.py", line 131, in <module>
cfg["partitions"]["info"]
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/__pyfiles__/spark_utils.py", line 10, in save_dataframe
.parquet(path)
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 844, in parquet
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o343.parquet
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:36063)
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque
I'm runnig the cluster with 1 master and 20 slave nodes of type r5.2xlarge. Each of them have 8 cpus and 64gb. The configuration of SPARK is:
20 gb executor memory
30 gb executor memory overhead
8 cores per executor
1 cpu per task
How can I solve this error?
I am working on a python spark project, where initially i had written a script to load dataframe to postgres for a particular client which include a utility function which loads data to postgres.
df.rdd.repartition(self.__max_conn).foreachPartition(
lambda iterator: load_utils.load_tab_postgres(conn_prop=conn_prop,
tab_name=<tablename>,
iterator=iterator))
Initially the entire code was in single module including the above code snippet and load_utils(), which was working perfectly fine.
Later i had to extract common code including load_utils into a base module that could be used in different client modules. This is when the same code failed with below error:
File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 764, in foreachPartition File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in count File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in fold File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in collect File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 45, in deco File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 18 in stage 126.0 failed 4 times, most recent failure: Lost task
18.3 in stage 126.0 (TID 24028, tbsatad6r15g24.company.co.us, executor 242): org.apache.spark.api.python.PythonException: Traceback (most
recent call last): File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile) File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 164, in _read_with_length
return self.loads(obj) File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 422, in loads
return pickle.loads(obj) File "build/bdist.linux-x86_64/egg/basemodule/init.py", line 12, in
import basemodule.entitymodule.base File "build/bdist.linux-x86_64/egg/basemodule/entitymodule/base.py", line
12, in File
"build/bdist.linux-x86_64/egg/basemodule/contexts.py", line 17, in
File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/conf.py",
line 104, in init
SparkContext._ensure_initialized() File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway() File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/java_gateway.py", line 48, in launch_gateway
SPARK_HOME = os.environ["SPARK_HOME"] File "/dhcommon/dhpython/python/lib/python2.7/UserDict.py", line 23, in
getitem
raise KeyError(key) KeyError: 'SPARK_HOME'
Below is the spark_submit command to run the code:
spark-submit --master yarn --deploy-mode cluster --driver-class-path
postgresql-42.2.4.jre6.jar --jars
spark-csv_2.10-1.4.0.jar,commons-csv-1.4.jar,postgresql-42.2.4.jre6.jar
--py-files project.egg driver_file.py
In both above scenarios the load_utils file containing "load_tab_postgres" method will be bundled in project.egg.
I hope somebody here can help. I've been googling this error like crazy but haven't found anything.
I have a pipeline that works perfectly when executed locally but it fails when executed on GCP. The following are the error messages that I get.
Workflow failed. Causes: S03:Write transform
fn/WriteMetadata/ResolveBeamFutures/CreateSingleton/Read+Write
transform fn/WriteMetadata/ResolveBeamFutures/ResolveFutures/Do+Write
transform fn/WriteMetadata/WriteMetadata failed., A work item was
attempted 4 times without success. Each time the worker eventually
lost contact with the service. The work item was attempted on:
Traceback (most recent call last): File "preprocess.py", line 491,
in
main() File "preprocess.py", line 487, in main
transform_data(args,pipeline_options,runner) File "preprocess.py", line 451, in transform_data
eval_data |= 'Identity eval' >> beam.ParDo(Identity()) File "/Library/Python/2.7/site-packages/apache_beam/pipeline.py", line 335,
in exit
self.run().wait_until_finish() File "/Library/Python/2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
line 897, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self) apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException:
Dataflow pipeline failed. State: FAILED, Error: Traceback (most recent
call last): File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 582, in do_work
work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
line 166, in execute
op.start() File "apache_beam/runners/worker/operations.py", line 294, in apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10607)
def start(self): File "apache_beam/runners/worker/operations.py", line 295, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10501)
with self.scoped_start_state: File "apache_beam/runners/worker/operations.py", line 300, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:9702)
pickler.loads(self.spec.serialized_fn)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py",
line 225, in loads
return dill.loads(s) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 277, in
loads
return load(file) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in
load
obj = pik.load() File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatchkey File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
obj = cls.new(cls, *args) TypeError: new() takes exactly 4 arguments (1 given)
Any ideas??
Thanks,
Pedro
If the pipeline works locally but fails on GCP it's possible that you're running into a version mismatch.
What TF, tf.Transform, beam versions are you running locally and on GCP?
I am using Airflow 1.9 to launch a Dataflow on Google Cloud Platform (GCP) thanks to a DataflowJavaOperator.
Below, the code used to launch dataflow from an Airflow Dag :
df_dispatch_data = DataFlowJavaOperator(
task_id='df-dispatch-data', # Equivalent to JobName
jar="/path/of/my/dataflow/jar",
gcp_conn_id="my_connection_id",
dataflow_default_options={
'project': my_project_id,
'zone': 'europe-west1-b',
'region': 'europe-west1',
'stagingLocation': 'gs://my-bucket/staging',
'tempLocation': 'gs://my-bucket/temp'
},
options={
'workerMachineType': 'n1-standard-1',
'diskSizeGb': '50',
'numWorkers': '1',
'maxNumWorkers': '50',
'schemaBucket': 'schemas_needed_to_dispatch',
'autoscalingAlgorithm': 'THROUGHPUT_BASED',
'readQuery': 'my_query'
}
)
However, even if all is right on GCP because the job succeed, an exception occured at the end of the dataflow job on my compute Airflow. It is thrown by the gcp_dataflow_hook.py :
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 528, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1584, in run
session=session)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1493, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/dataflow_operator.py", line 121, in execute
hook.start_java_dataflow(self.task_id, dataflow_options, self.jar)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 152, in start_java_dataflow
task_id, variables, dataflow, name, ["java", "-jar"])
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 146, in _start_dataflow
self.get_conn(), variables['project'], name).wait_for_done()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 31, in __init__
self._job = self._get_job()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 48, in _get_job
job = self._get_job_id_from_name()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 40, in _get_job_id_from_name
for job in jobs['jobs']:
KeyError: 'jobs'
Have you got an idea ?
This issue is caused by options used to launch the dataflow. If --zone or --region are given the google API to get job status does not work, only if default zone and regions, US/us-central1.
I am trying to connect to hbase-1.2.6 from python code, is as:
import happybase
connection = happybase.Connection(host='localhost',port=16010)
table = connection.table('blogpost')
table.put('1', {'post:title': 'hello world1'})
I manually created the table-"blogpost" in hbase. I am using python-2.7 and happybase-1.1.0.
error log are as:
/usr/bin/python2.7 /home/spark/PycharmProjects/PySpark/hbase.py
Traceback (most recent call last):
File "/home/spark/PycharmProjects/PySpark/hbase.py", line 5, in <module>
table.put('1', {'post:title': 'hello world1'})
File "/usr/local/lib/python2.7/dist-packages/happybase/table.py", line 464, in put
batch.put(row, data)
File "/usr/local/lib/python2.7/dist-packages/happybase/batch.py", line 137, in __exit__
self.send()
File "/usr/local/lib/python2.7/dist-packages/happybase/batch.py", line 60, in send
self._table.connection.client.mutateRows(self._table.name, bms, {})
File "/usr/local/lib/python2.7/dist-packages/thriftpy/thrift.py", line 198, in _req
return self._recv(_api)
File "/usr/local/lib/python2.7/dist-packages/thriftpy/thrift.py", line 210, in _recv
fname, mtype, rseqid = self._iprot.read_message_begin()
File "thriftpy/protocol/cybin/cybin.pyx", line 439, in cybin.TCyBinaryProtocol.read_message_begin (thriftpy/protocol/cybin/cybin.c:6470)
cybin.ProtocolError: No protocol version header
thanks.
Process finished with exit code 1