Memory allocation in Yarn Amazon EMR - amazon-web-services

I am having this error in pyspark (Amazon EMR), my file is about 2G. How can I do to change the allocation?
Thanks
In tried to increase the size of the cluster, at some stages I still have the problem
Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext. :
java.lang.IllegalArgumentException: Required executor memory (8192),
overhead (1536 MB), and PySpark memory (0 MB) is above the max
threshold (5760 MB) of this cluster! Please check the values of
'yarn.scheduler.maximum-allocation-mb' and/or
'yarn.nodemanager.resource.memory-mb'.

When you submit your job to Apache Spark you can add some parameters to your script to customize the memory example below.
Those parameters will overwrite the default configuration
Example
"--deploy-mode": "cluster",
"--num-executors": 60,
"--executor-memory": "16g",
"--executor-cores": 5,
"--driver-memory": "16g",
"--conf": {"spark.driver.maxResultSize": "2g"}

Related

aws lambda: Error: Runtime exited with error: signal: killed

I'm trying to pull a large file from S3 and write it to RDS using pandas dataframes.
I've been googling this error and haven't seen it anywhere, does anyone know what this extremely generic sounding error could mean? I've encountered memory issues previously but expanding the memory removed that error.
{
"errorType": "Runtime.ExitError",
"errorMessage": "RequestId: 99aa9711-ca93-4201-8b1e-73bf31b762a6 Error: Runtime exited with error: signal: killed"
}
Got the same error when executing the lambda for process an image, Only few results coming when searching in web for this error.
increase the AWS Lambda Memory by 1.5x OR 2x to resolve it. For example increase the memory from 128mb to 512mb.
This runtime error occurs because the lambda function does not execute remaining line of code, moreover it is not possible to catch the error and run the rest of the code.
You're reaching memory limit due to boto3 parallel uploading of your file.
You could increase memory usage for lambda, but it's cheating... you'll just pay more.
Per default, S3 cli downloads files larger than multipart_threshold=8MB with max_concurrency=10 parallel threads. It means it will use 80MB for your data, plus threading overhead.
You could reduce to max_concurrency=2 for example, it would use 16MB and it should fit into your lambda memory.
Please note that this may slightly decrease your downloading performance.
import boto3
from boto3.s3.transfer import TransferConfig
config = TransferConfig(max_concurrency=2)
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME', Config=config)
Reference: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html
It is not timing out at 15 minutes, since that would log this error "Task timed out after 901.02 seconds", which the OP did not get. As others have said, he is running out of memory.
First of all, aws-lambda is not meant to do long time heavy operations like pulling large files from S3 and write it to RDS.
This process can take too much time depending upon the file size and data. The maximum execution time of aws-lambda is 15 min. So, whatever task you are doing in your lambda should be completed with in the time limit you provided (Max is 15 min).
With large and heavy processing in lambda you can get out of memory error , out of time error or some times to need to extend your processing power.
The other way of doing such large and heavy processing is using AWS Glue Jobs which is aws managed ETL service.
Solution is to increase the AWS Lambda Memory by 1.5x OR 2x,
bcoz this runtime error occurs, the lambda function does not execute any other line of code, it is not possible to catch the error and run the rest of the code.
this error acts as the signal to the lambda execution environment to terminate the current execution.
To add, if anyone is using AWS Amplify as was in the project I was working on - there are still Lambda's under the hood, and you can access and configure them directly from the AWS Lambdas console

BigQuery unable to insert job. Workflow failed

I need to run a batch job from GCS to BigQuery via Dataflow and Beam. All my files are avro with the same schema.
I've created a dataflow java application that is successful on a smaller set of data (~1gb, about 5 files).
But when I try to run it on a bigger set of data ( >500gb, >1000 files), i receive an error message
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix 1b83679a4f5d48c5b45ff20b2b822728_6e48345728d4da6cb51353f0dc550c1b_00001_00000, reached max retries: 3, last failed load job: ...
After 3 retries it terminates with:
Workflow failed. Causes: S57....... A work item was attempted 4 times without success....
This step is the load to BigQuery.
Stack Driver says the processing is stuck in step ....for 10m00s... and
Request failed with code 409, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes.....
I looked up the 409 error code stating that I might have an existing job, dataset, or table. I've removed all the tables and re-ran the application but it still shows the same error message.
I am currently limited on 65 workers and I have them using n1-standard-4 cpus.
I believe there are other ways to move the data from gcs to bq, but i need to demonstrate dataflow.
"java.lang.RuntimeException: Failed to create job with prefix beam_load_csvtobigqueryxxxxxxxxxxxxxx, reached max retries: 3, last failed job: null.
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:198)..... "
One of the possible cause could be the privilege issue. Ensure the user account which interacts with the BigQuery has privilege "bigquery.jobs.create" in the predefined role "*BigQuery User"
Posting the comment of #DeaconDesperado as community wiki, where they experienced the same error and what they did was remove the restricted characters (eg. Unicode letters, marks, numbers, connectors, dashes or spaces) in the table name and the error is gone.
I got the same problem using "roles/bigquery.jobUser", "roles/bigquery.dataViewer", and "roles/bigquery.user". But only when granting "roles/bigquery.admin" did the issue get resolved.

Experienced problems with our RDS instance

we experienced problems with our RDS instance.
RDS stops running. RDS are in state of "green"(on the AWS console) but we cannot connect to the RDS instance.
Cloud Logs we found following errors:
2018-03-07 8:52:31 47886953160896 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: Set innodb_force_recovery to ignore this error.
2018-03-07 8:52:32 47886953160896 [ERROR] Plugin 'InnoDB' init function returned error.
2018-03-07 8:53:46 47508779897024 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: Set innodb_force_recovery to ignore this error.
2018-03-07 8:53:46 47508779897024 [ERROR] Plugin 'InnoDB' init function returned error.
When we tried to reboot RDS instance its take almost 2 hours to reboot. After rebooting its working fine again!.
Can someone help us to know the root cause of this incident.
As t2.small provides 2G of RAM. However you might be knowing, most DB engines tend to use up 75% of the memory for caching purposes such as queries, temporary tables, table scans to make things go faster.
For our Maria DB engine, following parameters are by default set to below pre-optimized values :
innodb_buffer_pool_size (DB instance size *3/4= 1.5 Gb)
key_buffer_size (16777216 = 16.7 Mb)
innodb_log_buffer_size (8388608 =8.3Mb)
Apart from that the OS and the RDS Processes will also use some amount of RAM to do their own operations. Hence to summarize, around 1.6 Gb approximately is utilized by DB engine and the actual usable memory which will be getting after taking out these values innodb_buffer_pool_size, key_buffer_size, innodb_log_buffer_size will be around 400 MB.
Overall a decrease in your Freeable Memory as low as ~137MB. As a result, Swap Usage increased drastically in the same time period to 152MB approximately.
FreeableMemory was quite low and there was a high swap utilization. Further, due to the memory pressure ( insufficient memory and high swap usage), RDS internal monitoring system was not able to proceed with host communication which in turn resulted into underlying host replacement.

AWS elastic-search. FORBIDDEN/8/index write (api). Unable to write to index

I am trying dump a list of docs to an AWS elastic-search instance. It was running fine. Then, all of sudden it started throwing this error:
{ _index: '<my index name>',
_type: 'type',
_id: 'record id',
status: 403,
error:
{ type: 'cluster_block_exception',
reason: 'blocked by: [FORBIDDEN/8/index write (api)];' } }
I checked in forums. Most of them says that it is a JVM memory issue. If it is going more than 92%, AWS will stop any writes to the cluster/index. However, when I checked the JVM memory, it shows less than 92%. I am missing something here?
This error is the Amazon ES service actively blocking writes to protect the cluster from reaching red or yellow status. It does this using index.blocks.write.
The two reasons being:
Low Memory
When the JVMMemoryPressure metric exceeds 92% for 30 minutes, Amazon ES triggers a protection mechanism and blocks all write operations to prevent the cluster from reaching red status. When the protection is on, write operations fail with a ClusterBlockException error, new indexes can't be created, and the IndexCreateBlockException error is thrown.
When the JVMMemoryPressure metric returns to 88% or lower for five minutes, the protection is disabled, and write operations to the cluster are unblocked.
Low Disk Space
Elasticsearch has a default "low watermark" of 85%, meaning that once disk usage exceeds 85%, Elasticsearch no longer allocates shards to that node. Elasticsearch also has a default "high watermark" of 90%, at which point it attempts to relocate shards to other nodes.
This error indicates that AWS ElasticSearch has placed a block on your domain based upon disk space. At 85%, ES will not allow you create any new indexes. At 90%, no new documents can be written.
ES could apply write block on index during rollovers, or Low disk space or memory.
In order to stop these errors you need to remove the write block on the index by setting index.blocks.write to false
curl -X PUT -H "Content-Type: application/json" \
'http://localhost:9200/{index_name}/_settings' \
-d '{ "index": { "blocks": { "write": "false" } } }'
The accepted solution was not enough in my case, I had to remove index.blocks.read_only_allow_delete as well
PUT /my_index/_settings
{
"index.blocks.read_only_allow_delete": null,
"index.blocks.write": null
}
ES version 7.15
This can also happen if the index you're trying to write to has been marked as read only. I've had it happen due to an Index State Management misconfiguration which caused a weekly index to be moved to a warm state after one day.

Cannot do simple task on ec2 spark cluster from local pyspark

I am trying to execute pyspark from my mac to do compute on a EC2 spark cluster.
If I login to the cluster, it works as expected:
$ ec2/spark-ec2 -i ~/.ec2/spark.pem -k spark login test-cluster2
$ spark/bin/pyspark
Then do a simple task
>>> data=sc.parallelize(range(1000),10)`
>>> data.count()
Works as expected:
14/06/26 16:38:52 INFO spark.SparkContext: Starting job: count at <stdin>:1
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 10 output partitions (allowLocal=false)
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at <stdin>:1)
...
14/06/26 16:38:53 INFO spark.SparkContext: Job finished: count at <stdin>:1, took 1.195232619 s
1000
But now if I try the same thing from local machine,
$ MASTER=spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077 bin/pyspark
it can't seem to connect to the cluster
14/06/26 09:45:43 INFO AppClient$ClientActor: Connecting to master spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077...
14/06/26 09:45:47 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
...
File "/Users/anthony1/git/incubator-spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.collect.
: org.apache.spark.SparkException: Job aborted: Spark cluster looks down
14/06/26 09:53:17 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I thought the problem was in the ec2 security but it does not help even after adding inbound rules to both master and slave security groups to accept all ports.
Any help will be greatly appreciated!
Others are asking same question on mailing list
http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-td4758.html#a8465
The spark-ec2 script configure the Spark Cluster in EC2 as standalone, which mean it can not work with remote submits. I've been struggled with this same error you described for days before figure out it's not supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your spark task.
In my experience Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory usually means you have accidentally set the cores too high, or set the executer memory too high - i.e. higher than what your nodes actually have.
Other, less likely causes, could be you got the URI wrong and your not really connecting to the master. And once I saw that problem when the /run partition was 100%.
Even less likely, your cluster may actually be down, and you need to restart your spark workers.