Why does datanode report block infomation to namenode in hdfs? - hdfs

When writing a file, the namenode determines which datanodes each block of the file is to be written to. At this time, the namenode maintains the mapping of files to blocks and blocks to datanodes.
Why do datanodes still need to report block information(the mapping of blocks to datanodes)?

Related

Could an HDFS read/write process be suspended/resumed?

I have one question regarding the HDFS read/write process:
Assuming that we have a client (for the sake of the example let's say that the client is a HADOOP map process) who requests to read a file from HDFS and or to write a file to HDFS, which is the process which actually does the read/write from/to the HDFS?
I know that there is a process for the Namenode and a process for each Datanode, what are their responsibilities to the system in general but I am confused in this scenario.
Is it the client's process by itself or is there another process in the HDFS, created and dedicated to the this specific client, in order to access and read/write from/to the HDFS?
Finally, if the second answer is true, is there any possibility that this process can be suspended for a while?
I have done some research and the most important solutions that I found were Oozie and JobControl class from hadoop API.
But, because I am not sure about the above workflow, I am not sure what process I am suspending and resuming with these tools.
Is it the client's process or a process which runs in HDFS in order to serve the request of the client?
Have a look at these SE posts to understand how HDFS writes work:
Hadoop 2.0 data write operation acknowledgement
Hadoop file write
Hadoop: HDFS File Writes & Reads
Apart from file/block writes, above question explain about datanode failure scenarios.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.
One failure in datanode triggers corrective actions by framework.
Regarding your second query :
You have two types of schedulers :
FairScheduler
CapacityScheduler
Have a look at this article on suspend and resume
In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs.
In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs.
So as far as I understand the process of a Datanode receives the data from the client's process (who requests to store some data in HDFS) and stores it. Then this Datanode forwards the exact same data to another Datanode (to achieve replication) and so on. When the replication will finish, an acknowledgement will go back to the Namenode who will finally inform the client about the completion of his write-request.
Based on the above flow, It is impossible to suspend an HDFS write operation in order to serve a second client's write-request (let's assume that the second client has higher priority) because if we suspend the Datanode by itself it will remain suspended for everyone who wants to write on it and as a result this part of the HDFS will be remained blocked. Finally, if I suspend a job from JobController class functions, I actually suspend the client's process (if I actually manage to catch it before his request will be done). Please correct me if I am wrong.

How does Google file system deal with write failures at replicas?

I was reading the Google file system paper and am not sure how it deals with write (not atomic record append) failures at replicas. If it returns success, then it will wait until the next heartbeat for the master to get the updated state of the world and detect the corruption/stale chunk version and delete the chunk. I guess that the master can verify the validity of all the replicas whenever clients ask for replica locations, to prevent client from ever getting stale/corrupted data. Is this how it deals with replica write failure?
The invalid chunks can be divided into two categories: stale and corrupted checksum. They are checked in two different ways.
Stale chunk. The version of chunk is not up to date. The master checks stale chunks during regular heartbeat with chunk servers.
Corrupted checksum. Since divergent replicas may be legal, replicas across the GFS are not guaranteed to be identical. In addition, for performance consideration, the checksum is performed independently on the chunk servers themselves rather than on the master.
Checksum can be checked in two phases:
When clients or other chunk servers request for the chunk
When the chunk servers are during idle periods, they scan and verify the inactive chunks to avoid corrupted chunks are considered as valid replicas.
If the checksum is corrupted, the chunk server reports the master about the problem. The master clones the replica from other chunk servers with healthy replicas. After that, the master instructs the chunk server that reports the problem to delete the chunk.
Back to your question, how GFS deal with replica write failure?
If any error is encounter during replication, the failure of mutation is reported to the client. The client must handle the error and retry the mutation. The inconsistent chunks will be garbage collected during regular scan in chunk servers.
Your question is not very clear but I'll answer what I made an educated guess of what your question would be. It seems related to how chunk version number helps in stale replica detection-:
Clients asks to write in files.
Master sends metadata response to client. In that metadata response,a chunk version number is also included. This is indication to grant lease of a chunkserver to client.
Now, the master increments the chunk version number and asks all the chunkservers (chunks) to do the same once the WRITE is COMPLETED.
All this has happened before the master starts writing on file on the chunk.
Say the chunkserver crashed.
Once the chunkserver restarts, master-chunkserver communicates in heartbeat messages and compares the "chunk version number". The thing is, if the write was accomplished, then the chunk version number in all chunks/replicas should've been same as the master's chunk version number. If it's not the same, failure has occured during writing.
So the master decrements its chunk version number and during garbage collection, all those failed replicas are removed.

Hadoop data structure to save block report in namenode

Hadoop saves files in form of blocks on data nodes(DN) and metadata about files is saved into namenodes(NN). Whenever any file is being read by a client then NN sends a read-pipeline (list of DNs) from where file blocks are to be picked up. Read pipeline consists of nearest DNs (w.r.t client) to serve read request.
I am curious to know how NN keeps information about DN for blocks of file. I mean the data structure. Is it a graph with information of all replicas location of DN? and later on while creating a read-pipeline some algo is used to find shortest path between client and corresponding DNs?
There is not really a term named "read pipeline". For read/write request, the granularity is block level. Generally the write pipeline is like: client -> DN1 -> DN2 -> DN3. For read, client connects the NN for the DN list that hold replicas of the block to read. Then the client reads the data from the nearest DN directly, without involving other DNs (in case of error, client may try another DN in the list).
As to the "data structure" of DN for block information, there is a block -> DNs in-memory mapping maintained by NN. Basically the mapping is a map. To update the map, DNs will periodically report its local replica of blocks to NN.
Client is free to choose the nearest DN for read. For this, the HDFS should be topology-aware. From the HDFS architecture doc:
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.

Checksum computation in WebHdfs

When file is ingested using "hdfs dfs -put" client computes checksum and sends both input data+checksum to Datanode for storing.
How does this checksum calculatio/validation happen when File is read/write using WebHdfs ? how data integrity is insured with WebHdfs ?
Hadoop documentation on apache don't mention anything about it.
WebHDFS is just a proxy through the usual datanode operations. Datanodes host the webhdfs servlets, which open standard DFSClients and read or write data through the standard pipeline. It's an extra step in the normal process but does not fundamentally change it. Here is a brief overview.

Checksum process in HDFS /Hadoop

I could not understand how checksum works in HDFS to identify corrupt blocks while file writing reading. can someone explain me in details ?
Have a look at Apache documentation regarding HDFS Architecture.
Data Integrity
It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software.
It works in below way.
The HDFS client software implements checksum checker. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file.
If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
If checksum of another Data node block matches with checksum of hidden file, system will serve these data blocks.
Have a look at Robustness section too. The solution will be incomplete without having look at data replication mechanism.
Each DataNode sends a Heartbeat message to the NameNode periodically.
A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message.
The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more.
DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary.
The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.
Example of replication scenario:
It depends configuration of the cluster
dfs.replication
(Assume that it is 3)
dfs.namenode.replication.min ( Assume that it is 1 )
In case of one Data Node is lost, Name node will recognize that a block is under-replicated. Then Name Node will replicate the data blocks until dfs.replicaiton is met.