Why can't HDFS use the remaining good data-nodes in its pipeline recovery process? - hdfs

Question
Why can't HDFS use the remaining good data-nodes in its pipeline recovery process
Setup
We have 5 Data Notes in our HDFS cluster.
We have replication factor of 3.
We have set dfs.client.block.write.replace-datanode-on-failure.policy to DEFAULT
One of the Data Nodes is taken down when a write is in progress.
There are 4 remaining Data Nodes in our HDFS cluster.
Exception
We see an ERROR log statement with the following exception
> java.io.IOException: Failed to replace a bad datanode on the existing
> pipeline due to no more good datanodes being available to try. (Nodes:
> current=[DatanodeInfoWithStorage[xxx.xxx.xxx.65:xxxx,DS-0ea7ee77-34e2-41e1-9343-50c373fe08e1,DISK],
> DatanodeInfoWithStorage[xxx.xxx.xxx.61:xxxx,DS-894ff0a4-9906-499a-919d-4af136208502,DISK]],
> original=[DatanodeInfoWithStorage[xxx.xxx.xxx.65:xxxx,DS-0ea7ee77-34e2-41e1-9343-50c373fe08e1,DISK],
> DatanodeInfoWithStorage[xxx.xxx.xxx.61:xxxx,DS-894ff0a4-9906-499a-919d-4af136208502,DISK]]).
> The current failed datanode replacement policy is DEFAULT, and a
> client may configure this via
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its
> configuration.
> at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
> at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
> at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
> at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
> at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
I see no exceptions in the name node at this time.
Please do not respond by asking me to change dfs.client.block.write.replace-datanode-on-failure.policy unless it will force HDFS to use the remaining good nodes.
I like to understand why my observations do not match expected behaviour of HDFS.
I have looked at the code and it looks like the ClientProtocol implementation is sending back the set of nodes that are currently being used in the pipeline.
DataStreamerclass throws this exception because the length of the nodes currently being used in the piplein is the same suggested set of nodes.

Related

Rebalance Akka Cluster if One Of Shard Is not Resolving

Intermittently we are receiving following errors
2022-05-25 08:32:30,691 ERROR app=abc a.c.s.DDataShardCoordinator - The ShardCoordinator was unable to update a distributed state within ‘updating-state-timeout’: 2000 millis (retrying). Perhaps the ShardRegion has not started on all active nodes yet? event=ShardRegionRegistered(Actor[akka://application#10.52.174.4:25520/system/sharding/abcapp#-1665332307])
2022-05-25 08:32:31,348 WARN app=abc a.c.s.ShardRegion - abcapp: Trying to register to coordinator at [ActorSelection[Anchor(akka://application#10.52.103.132:25520/), Path(/system/sharding/abcappCoordinator/singleton/coordinator)]], but no acknowledgement. Total [22] buffered messages. [Coordinator [Member(address = akka://application#10.52.103.132:25520, status = Up)] is reachable.]
While we check cluster members by using /cluster/members we got “10.52.174.4:25520” this as
{
“node”: “akka://application#10.52.252.4:25520”,
“nodeUid”: “7353086881718190138”,
“roles”: [
“dc-default”
],
“status”: “Up”
},
Which says its healthy but problem resolves while we remove this node from the cluster using
/cluster/members/{address} (leave operation to remove 10.52.252.4 from cluster, once it’s removed cluster will create new pod and rebalance.
Need help to understand the best way of handling this error.
Thanks
You can of course implement an external control plane to parse logs and take a node exhibiting this error out of the cluster.
That said, it's better to understand what's happening here. The ShardCoordinator runs on the oldest node in the cluster, and needs to ensure that there's agreement on things like which nodes own which shards. It accomplishes this by requiring that updates be acknowledged by a majority of nodes in the cluster. If a state update isn't acknowledged, then further updates to the state (e.g. rebalances) are delayed.
I said "majority", but because in clusters where there's substantial node turnover relative to the size of the cluster simple majorities can lead to data loss, it becomes more complex. Consider a cluster of 3 nodes, N1, N2, N3. N1 (the ShardCoordinator) updates state and considers it successful when it and N3 have updated state. N1 is dropped from the cluster and replaced by N4; N2 becomes the shard coordinator (being the next oldest node) and requests state from itself and the other nodes; N4 responds first. The result becomes that the state update N1 made is lost. So two other settings come into play:
akka.cluster.coordinator-state.write-majority-plus (default 3) which adds that to the majority write requirement (rounding down)
akka.cluster.distributed-data.majority-min-cap (default 5) which requires that the majority plus the added nodes must be at least this
If the computed majority is greater than the number of nodes, the majority becomes all nodes. So in a cluster with fewer than 9 nodes with the defaults these become effectively all nodes (and the actual timeout when updating is a quarter of the configured timeout, to allow for three retries).
You don't say what your cluster size is, but if running in a cluster with fewer than 9 nodes, it can be a good idea to increase the akka.cluster.sharding.updating-state-timeout from the default 5 seconds to allow for the increased consistency level. Decreasing write-majority-plus and majority-min-cap can be an option, if you're willing to take the risks of violating cluster sharding's guarantees (e.g. multiple instances of the same entity running and potentially destroying their persistent state). Increasing the cluster size can also be helpful, paradoxically, if the reason other nodes are slow to respond is overload.

Collect one cell from pyspark Dataframe failed [duplicate]

I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command.
17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed
17/12/27 18:33:19 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:726)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:755)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection from /XXX.XX.XXX.XX:36245 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
The reason of adding this configuration was the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o171.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Therefore, I increased maxResultSize to 2.5 Gb, but the Spark job fails anyway (the error shown above).
How to solve this issue?
It seems like the problem is the amount of data you are trying to pull back to to your driver is too large. Most likely you are using the collect method to retrieve all values from a DataFrame/RDD. The driver is a single process and by collecting a DataFrame you are pulling all of that data you had distributed across the cluster back to one node. This defeats the purpose of distributing it! It only makes sense to do this after you have reduced the data down to a manageable amount.
You have two options:
If you really need to work with all that data, then you should keep it out on the executors. Use HDFS and Parquet to save the data in a distributed manner and use Spark methods to work with the data on the cluster instead of trying to collect it all back to one place.
If you really need to get the data back to the driver, you should examine whether you really need ALL of the data or not. If you only need summary statistics then compute that out on the executors before calling collect. Or if you only need the top 100 results, then only collect the top 100.
Update:
There is another reason you can run into this error that is less obvious. Spark will try to send data back the driver beyond just when you explicitly call collect. It will also send back accumulator results for each task if you are using accumulators, data for broadcast joins, and some small status data about each task. If you have LOTS of partitions (20k+ in my experience) you can sometimes see this error. This is a known issue with some improvements made, and more in the works.
The options for getting past if if this is your issue are:
Increase spark.driver.maxResultSize or set it to 0 for unlimited
If broadcast joins are the culprit, you can reduce spark.sql.autoBroadcastJoinThreshold to limit the size of broadcast join data
Reduce the number of partitions
Cause: caused by actions like RDD's collect() that send big chunk of data to the driver
Solution:
set by SparkConf: conf.set("spark.driver.maxResultSize", "4g")
OR
set by spark-defaults.conf: spark.driver.maxResultSize 4g
OR
set when calling spark-submit: --conf spark.driver.maxResultSize=4g

Create cron in chef

I want to create Cron in chef witch they verify size of the log if it's > 30mb it will delete it, here is my code:
cron_d 'ganglia_tomcat_thread_max' do
hour '0'
minute '1'
command "rm - f /srv/node/current/app/log/simplesamlphp.log"
only_if { ::File.size('/srv/node/current/app/log/simplesamlphp.log').to_f / 1024000 > 30 }
end
Can you help me in it please
Welcome to Stackoverflow!
I suggest you to go with existing tools like "logrotate". There is a chef cookbook available to manage logrotate.
Please note, that "cron" in chef manages the system cron service which runs independently of chef. You'll have to do the file size check within the "command". It's also better to use the cron_d resource as documented here.
In the way you create cron_d resource it will add cron task only when your log file has size greater than 30mb. In all other cases cron_d will be not created.
You can check that ruby code
File.size('file').to_f / 2**20
to get the file size in Megabytes - there is a slight difference in the result I believe that is more correct.
so you can go wirh 2 solutions for your specific case
create new cron_d resource when log file is less than 30 mb to remove existing cron and provision your node periodically
move the check of the file size in the command with bash and glue with && - in that case file will be dated only if size is greater than 30mb. something like that
du -k file.txt | cut -f1
will return size of the file in bytes
To me also correct way to to that is to use logrotate service and chef recipe for that.

Aerospike error: All batch queues are full

I am running an Aerospike cluster in Google Cloud. Following the recommendation on this post, I updated to the last version (3.11.1.1) and re-created all servers. In fact, this change cause my 5 servers to operate in a much lower CPU load (it was around 75% load before, now it is on 20%, as show in the graph bellow:
Because of this low load, I decided to reduce the cluster size to 4 servers. When I did this, my application started to receive the following error:
All batch queues are full
I found this discussion about the topic, recommending to change the parameters batch-index-threads and batch-max-unused-buffers with the command
asadm -e "asinfo -v 'set-config:context=service;batch-index-threads=NEW_VALUE'"
I tried many combinations of values (batch-index-threads with 2,4,8,16) and none of them solved the problem, and also changing the batch-index-threads param. Nothing solves my problem. I keep receiving the All batch queues are full error.
Here is my aerospace.conf relevant information:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
I use 300GB SSD disks on these servers.
A quick note which may or may not pertain to you:
A common mistake we have seen in the past is that developers decide to use 'batch get' as a general purpose 'get' for single and multiple record requests. The single record get will perform better for single record requests.
It's possible that you are being constrained by the network between the clients and servers. Reducing from 5 to 4 nodes reduced the aggregate pipe. In addition, removing a node will start cluster migrations which adds additional network load.
I would look at the batch-max-buffer-per-queue config parameter.
Maximum number of 128KB response buffers allowed in each batch index
queue. If all batch index queues are full, new batch requests are
rejected.
In conjunction with raising this value from the default of 255 you will want to also raise the batch-max-unused-buffers to batch-index-threads x batch-max-buffer-per-queue + 1 (at least). If you do not do that new buffers will be created and destroyed constantly, as the amount of free (unused) buffers is smaller than the ones you're using. The moment the batch response is served the system will strive to trim the buffers down to the max unused number. You will see this reflected in the batch_index_created_buffers metric constantly rising.
Be aware that you need to have enough DRAM for this. For example if you raise the batch-max-buffer-per-queue to 320 you will consume
40 (`batch-index-threads`) x 320 (`batch-max-buffer-per-queue`) x 128K = 1600MB
For the sake of performance the batch-max-unused-buffers should be set to 13000 which will have a max memory consumption of 1625MB (1.59GB) per-node.

AWS EMR Parallel Mappers?

I am trying to determine how many nodes I need for my EMR cluster. As part of best practices the recommendations are:
(Total Mappers needed for your job + Time taken to process) / (per instance capacity + desired time) as outlined here: http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013, page 89.
The question is how to determine how many parallel mappers the instance will support since AWS don't publish? https://aws.amazon.com/emr/pricing/
Sorry if i missed something obvious.
Wayne
To determine the number of parallel mappers , you will need to check this documentation from EMR called Task Configuration where EMR had a predefined mapping set of configurations for every instance type which would determine the number of mappers/reducers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html
For example : Lets say you have 5 m1.xlarge core nodes. According to the default mapred-site.xml configuration values for that instance type from EMR docs, we have
mapreduce.map.memory.mb = 768
yarn.nodemanager.resource.memory-mb = 12288
yarn.scheduler.maximum-allocation-mb = 12288 (same as above)
You can simply divide the later with former setting to get the maximum number of mappers supported by one m1.xlarge node = (12288/768) = 16
So, for the 5 node cluster , it would a max of 16*5 = 80 mappers that can run in parallel (considering a map only job). The same is the case with max parallel Reducers(30). You can do similar math for a combination of mappers and reducers.
So, If you want to run more mappers in parallel , you can either re-size the cluster or reduce the mapreduce.map.memory.mb(and its heap mapreduce.map.java.opts) on every node and restart NM to
To understand what the above mapred-site.xml properties mean and why you do need to do those calculations , you can refer it here :
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Note : The above calculations and statements are true if EMR stays in its default configuration using YARN capacity scheduler with DefaultResourceCalculator. If for example , you configure your capacity scheduler to use DominantResourceCalculator, it will consider VCPU's + Memory on every nodes (not just memory's) to decide on parallel number of mappers.