I am trying to cross join two data frames and apply few transformations and finally trying to write the result into temp S3 location. But I am always ending up with below No space left on device error. Looks like it is due to calling spill(). Could you please help me how to overcome this error with the correct configurations?
Configuration details:
Cluster: AWS EMR cluster
CORE nodes: 2 initially and it scaling up to 15 nodes.
TASK nodes: 0 initially and it scaling up to 15 ON-DEMAND basis.
instance type: r4.2xlarge (8 core, 61GB RAM, 128 EBS)
Dataframe1 & Dataframes2 partitions size: 26 partitions.
Dataframe1 record count = 115580
Dataframe2 record count = 94191
Dataframe1 columns count: 53 ( 1 column holding JSON data)
Dataframe2 columns count: 36
spark.sql.shuffle.partitions: 500
"spark.executor.memoryOverhead": "4852"
"spark.driver.memoryOverhead": "4852"
Error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 63 in stage 68.0 failed 4 times, most recent failure: Lost task 63.3 in stage 68.0 (TID 1640) (ip-10-66-199-71.ec2.internal executor 44):
org.apache.spark.memory.SparkOutOfMemoryError: error while calling spill() on org.apache.spark.shuffle.sort.ShuffleExternalSorter#7ea8a25 : No space left on device
org.apache.spark.memory.SparkOutOfMemoryError: error while calling spill() on org.apache.spark.shuffle.sort.ShuffleExternalSorter#7ea8a25 : No space left on device
Thanks in Advance..!!
Sekhar
Its a common issue, and AWS provides official documentation on how to solve it:
How do I resolve "no space left on device" stage failures in Spark on Amazon EMR?
Related
I am having this error in pyspark (Amazon EMR), my file is about 2G. How can I do to change the allocation?
Thanks
In tried to increase the size of the cluster, at some stages I still have the problem
Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext. :
java.lang.IllegalArgumentException: Required executor memory (8192),
overhead (1536 MB), and PySpark memory (0 MB) is above the max
threshold (5760 MB) of this cluster! Please check the values of
'yarn.scheduler.maximum-allocation-mb' and/or
'yarn.nodemanager.resource.memory-mb'.
When you submit your job to Apache Spark you can add some parameters to your script to customize the memory example below.
Those parameters will overwrite the default configuration
Example
"--deploy-mode": "cluster",
"--num-executors": 60,
"--executor-memory": "16g",
"--executor-cores": 5,
"--driver-memory": "16g",
"--conf": {"spark.driver.maxResultSize": "2g"}
I need to run a batch job from GCS to BigQuery via Dataflow and Beam. All my files are avro with the same schema.
I've created a dataflow java application that is successful on a smaller set of data (~1gb, about 5 files).
But when I try to run it on a bigger set of data ( >500gb, >1000 files), i receive an error message
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix 1b83679a4f5d48c5b45ff20b2b822728_6e48345728d4da6cb51353f0dc550c1b_00001_00000, reached max retries: 3, last failed load job: ...
After 3 retries it terminates with:
Workflow failed. Causes: S57....... A work item was attempted 4 times without success....
This step is the load to BigQuery.
Stack Driver says the processing is stuck in step ....for 10m00s... and
Request failed with code 409, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes.....
I looked up the 409 error code stating that I might have an existing job, dataset, or table. I've removed all the tables and re-ran the application but it still shows the same error message.
I am currently limited on 65 workers and I have them using n1-standard-4 cpus.
I believe there are other ways to move the data from gcs to bq, but i need to demonstrate dataflow.
"java.lang.RuntimeException: Failed to create job with prefix beam_load_csvtobigqueryxxxxxxxxxxxxxx, reached max retries: 3, last failed job: null.
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:198)..... "
One of the possible cause could be the privilege issue. Ensure the user account which interacts with the BigQuery has privilege "bigquery.jobs.create" in the predefined role "*BigQuery User"
Posting the comment of #DeaconDesperado as community wiki, where they experienced the same error and what they did was remove the restricted characters (eg. Unicode letters, marks, numbers, connectors, dashes or spaces) in the table name and the error is gone.
I got the same problem using "roles/bigquery.jobUser", "roles/bigquery.dataViewer", and "roles/bigquery.user". But only when granting "roles/bigquery.admin" did the issue get resolved.
I am trying dump a list of docs to an AWS elastic-search instance. It was running fine. Then, all of sudden it started throwing this error:
{ _index: '<my index name>',
_type: 'type',
_id: 'record id',
status: 403,
error:
{ type: 'cluster_block_exception',
reason: 'blocked by: [FORBIDDEN/8/index write (api)];' } }
I checked in forums. Most of them says that it is a JVM memory issue. If it is going more than 92%, AWS will stop any writes to the cluster/index. However, when I checked the JVM memory, it shows less than 92%. I am missing something here?
This error is the Amazon ES service actively blocking writes to protect the cluster from reaching red or yellow status. It does this using index.blocks.write.
The two reasons being:
Low Memory
When the JVMMemoryPressure metric exceeds 92% for 30 minutes, Amazon ES triggers a protection mechanism and blocks all write operations to prevent the cluster from reaching red status. When the protection is on, write operations fail with a ClusterBlockException error, new indexes can't be created, and the IndexCreateBlockException error is thrown.
When the JVMMemoryPressure metric returns to 88% or lower for five minutes, the protection is disabled, and write operations to the cluster are unblocked.
Low Disk Space
Elasticsearch has a default "low watermark" of 85%, meaning that once disk usage exceeds 85%, Elasticsearch no longer allocates shards to that node. Elasticsearch also has a default "high watermark" of 90%, at which point it attempts to relocate shards to other nodes.
This error indicates that AWS ElasticSearch has placed a block on your domain based upon disk space. At 85%, ES will not allow you create any new indexes. At 90%, no new documents can be written.
ES could apply write block on index during rollovers, or Low disk space or memory.
In order to stop these errors you need to remove the write block on the index by setting index.blocks.write to false
curl -X PUT -H "Content-Type: application/json" \
'http://localhost:9200/{index_name}/_settings' \
-d '{ "index": { "blocks": { "write": "false" } } }'
The accepted solution was not enough in my case, I had to remove index.blocks.read_only_allow_delete as well
PUT /my_index/_settings
{
"index.blocks.read_only_allow_delete": null,
"index.blocks.write": null
}
ES version 7.15
This can also happen if the index you're trying to write to has been marked as read only. I've had it happen due to an Index State Management misconfiguration which caused a weekly index to be moved to a warm state after one day.
I am trying to get acquainted with Amazon Big Data tools and I want to preprocess data from S3 for eventually using it for Machine Learning.
I am struggling to understand how to effectively read data into an AWS EMR Spark cluster.
I have a Scala script which takes a lot of time to run, most of that time is taken up by Spark's explode+pivot on my data and then using Spark-CSV to write to file.
But even reading the raw data files takes up too much time in my view.
Then I created a script only to read in data with sqlContext.read.json() from 4 different folders (data sizes of 0.18MB, 0.14MB, 0.0003MB and 399.9MB respectively). I used System.currentTimeMillis() before and after each read function to see how much time it takes and with 4 different instances' settings the results were the following:
m1.medium (1) | m1.medium (4) | c4.xlarge (1) | c4.xlarge (4)
1. folder 00:00:34.042 | 00:00:29.136 | 00:00:07.483 | 00:00:06.957
2. folder 00:00:04.980 | 00:00:04.935 | 00:00:01.928 | 00:00:01.542
3. folder 00:00:00.909 | 00:00:00.673 | 00:00:00.455 | 00:00:00.409
4. folder 00:04:13.003 | 00:04:02.575 | 00:03:05.675 | 00:02:46.169
The number after the instance type indicates how many nodes were used. 1 is only master and 4 is one master, 3 slaves of the same type.
Firstly, it is weird that reading in first two similarly sized folders take up different amount of time.
But still how does it take so much time (in seconds) to read in less than 1MB of data?
I had 1800MB of data a few days ago and my job with data processing script on c4.xlarge (4 nodes) took 1,5h before it failed with error:
controller log:
INFO waitProcessCompletion ended with exit code 137 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 4870 seconds
2016-07-01T11:50:38.920Z INFO Step created jobs:
2016-07-01T11:50:38.920Z WARN Step failed with exitCode 137 and took 4870 seconds
stderr log:
16/07/01 11:50:35 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 4 (MapPartitionsRDD[21] at json at DataPreProcessor.scala:435)
16/07/01 11:50:35 INFO TaskSchedulerImpl: Adding task set 4.0 with 24 tasks
16/07/01 11:50:36 WARN TaskSetManager: Stage 4 contains a task of very large size (64722 KB). The maximum recommended task size is 100 KB.
16/07/01 11:50:36 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 5330, localhost, partition 0,PROCESS_LOCAL, 66276000 bytes)
16/07/01 11:50:36 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 5331, localhost, partition 1,PROCESS_LOCAL, 66441997 bytes)
16/07/01 11:50:36 INFO Executor: Running task 0.0 in stage 4.0 (TID 5330)
16/07/01 11:50:36 INFO Executor: Running task 1.0 in stage 4.0 (TID 5331)
Command exiting with ret '137'
This data was doubled in size over the weekend. So if I get ~1GB of new data each day now (and it will grow soon and fast) then I hit big data sizes very soon and I really need an effective way to read and process the data quickly.
How can I do that? Is there anything that I am missing? I can upgrade my instances, but for me it does not seem normal that reading in 0.2MB of data with 4x c4.xlarge (4 vCPU, 16ECU, 7.5GiB mem) instances take 7 seconds (even with inferring data schema automatically for ~200 JSON attributes).
I have created an ES domain to search for the vpcglow logs and cloud trail log with daily indexing.
Right now, the status is RED :
{
"cluster_name": "678628912247:test",
"status": "red",
"timed_out": false,
"number_of_nodes": 17,
"number_of_data_nodes": 17,
"active_primary_shards": 687,
"active_shards": 1374,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 8,
"number_of_pending_tasks": 0
}
Further investigating I found, one index is RED:
red open cwl-2016.02.19 5 1 381700 102899 335.8mb 167.9mb
Looking into the shards:
cwl-2016.02.19 2 p UNASSIGNED
cwl-2016.02.19 2 r UNASSIGNED
cwl-2016.02.19 0 p UNASSIGNED
cwl-2016.02.19 0 r UNASSIGNED
cwl-2016.02.19 3 p STARTED 381700 167.9mb x.x.x.x Elektra Natchios
cwl-2016.02.19 3 r STARTED 381700 167.9mb x.x.x.x Chronos
cwl-2016.02.19 1 p UNASSIGNED
cwl-2016.02.19 1 r UNASSIGNED
cwl-2016.02.19 4 p UNASSIGNED
cwl-2016.02.19 4 r UNASSIGNED
I tried to reroute the shards to less used nodes, but it gives me:
{"Message":"Your request: '/_cluster/reroute' is not allowed."}
Any advice please what I should do now.
Thanks & Regards.
A red cluster status means that at least one primary shard and its replicas are not allocated to a node.
Since you have already found the red index best option is to delete it.
If deletion is not possible then you restore the snapshot (Please note AWS automatically takes snapshots)
As last resort you can contact AWS support and they can restore it for you.
Its important to fix red cluster since once cluster is red AWS stop taking automatic snapshots.
RED cluster means one or more primary shards are not available and its means a data-loss and its very serious issue which requires an immediate fix.
If you have snapshot then try to recover the index from it.
In future try to increase replica so that you don't loose the primary shards and it can be easily recover from replica shards.
Check the ES cluster logs and try to find out the reason of missing primary shards.
See if reroute API can be useful, it will if you have shard available on disk but there is no data-node where ES can allocate it, see if you can data-node or create a configuration which can recover the primary shards.
Regarding the Error when trying to run reroute API, it seems its a permission issue which you can solve by having proper access.
Elasticsearch Allocation API
Allocation API will help you understand the cluster allocation issues.
curl -XGET "location:9200/_cluster/allocation/explain"
Resolve the issues or reasons explained by allocation API and reinitiate the allocation with the following
curl -X POST http://127.0.0.1:9200/_cluster/reroute?retry_failed=true