Checksum computation in WebHdfs - hdfs

When file is ingested using "hdfs dfs -put" client computes checksum and sends both input data+checksum to Datanode for storing.
How does this checksum calculatio/validation happen when File is read/write using WebHdfs ? how data integrity is insured with WebHdfs ?
Hadoop documentation on apache don't mention anything about it.

WebHDFS is just a proxy through the usual datanode operations. Datanodes host the webhdfs servlets, which open standard DFSClients and read or write data through the standard pipeline. It's an extra step in the normal process but does not fundamentally change it. Here is a brief overview.

Related

flink readCSV thrown back with "org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException:Timeout waiting for connection from pool"

We are using Flink 1.9.0 Dataset API to read CSV files from Amazon S3 Bucket. Facing connection pool timeout most of the times.
Following are the configurations at Flink level
Reading 19708 objects from s3 in a single go, as we need to apply the logic on top of whole data set. Lets take an eg: Imagine have 20 source folders eg( AAA, BBB, CCC ) with multiple subfolders (AAA/4May2020/../../1.csv,AAA/4May2020/../../2.csv, AAA/3May2020/../../1.csv ,AAA/3May2020/../../2.csv ....), for the read to happen, before calling the readCSV, the logic scan folders and pick the one only with latest date folder and pass that for read. For the read operation we use parallelism as "5". But when the execution graph is formed all 20 Sources comes together.
Running on Kube-Aws with around 10
Task Managers hosted under "m5.4X large machine". Task Manager
docker is allocated with "8" cores and "50GB" memory.
Following were tried to address the issue, but no luck so far. Really need some pointers and help to address this
Enabled the Flink retry mechanism with failover as "region", sometimes with retries it gets through. But even with retry it fails intermittently.
Revisited the core-site.xml as per AWS Site:
fs.s3a.threads.max:3000,fs.s3a.connection.maximum:4500
Also could anyone help with the following questions
Is there anyway to check if the HTTP connections opened by readCSV
are closed
Any pointers to understand how dataset ReadCSV operates
will help.
Any way to introduce a wait mechanisms before the
read?
Any better way to address this issue

Could an HDFS read/write process be suspended/resumed?

I have one question regarding the HDFS read/write process:
Assuming that we have a client (for the sake of the example let's say that the client is a HADOOP map process) who requests to read a file from HDFS and or to write a file to HDFS, which is the process which actually does the read/write from/to the HDFS?
I know that there is a process for the Namenode and a process for each Datanode, what are their responsibilities to the system in general but I am confused in this scenario.
Is it the client's process by itself or is there another process in the HDFS, created and dedicated to the this specific client, in order to access and read/write from/to the HDFS?
Finally, if the second answer is true, is there any possibility that this process can be suspended for a while?
I have done some research and the most important solutions that I found were Oozie and JobControl class from hadoop API.
But, because I am not sure about the above workflow, I am not sure what process I am suspending and resuming with these tools.
Is it the client's process or a process which runs in HDFS in order to serve the request of the client?
Have a look at these SE posts to understand how HDFS writes work:
Hadoop 2.0 data write operation acknowledgement
Hadoop file write
Hadoop: HDFS File Writes & Reads
Apart from file/block writes, above question explain about datanode failure scenarios.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.
One failure in datanode triggers corrective actions by framework.
Regarding your second query :
You have two types of schedulers :
FairScheduler
CapacityScheduler
Have a look at this article on suspend and resume
In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs.
In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs.
So as far as I understand the process of a Datanode receives the data from the client's process (who requests to store some data in HDFS) and stores it. Then this Datanode forwards the exact same data to another Datanode (to achieve replication) and so on. When the replication will finish, an acknowledgement will go back to the Namenode who will finally inform the client about the completion of his write-request.
Based on the above flow, It is impossible to suspend an HDFS write operation in order to serve a second client's write-request (let's assume that the second client has higher priority) because if we suspend the Datanode by itself it will remain suspended for everyone who wants to write on it and as a result this part of the HDFS will be remained blocked. Finally, if I suspend a job from JobController class functions, I actually suspend the client's process (if I actually manage to catch it before his request will be done). Please correct me if I am wrong.

Hadoop data structure to save block report in namenode

Hadoop saves files in form of blocks on data nodes(DN) and metadata about files is saved into namenodes(NN). Whenever any file is being read by a client then NN sends a read-pipeline (list of DNs) from where file blocks are to be picked up. Read pipeline consists of nearest DNs (w.r.t client) to serve read request.
I am curious to know how NN keeps information about DN for blocks of file. I mean the data structure. Is it a graph with information of all replicas location of DN? and later on while creating a read-pipeline some algo is used to find shortest path between client and corresponding DNs?
There is not really a term named "read pipeline". For read/write request, the granularity is block level. Generally the write pipeline is like: client -> DN1 -> DN2 -> DN3. For read, client connects the NN for the DN list that hold replicas of the block to read. Then the client reads the data from the nearest DN directly, without involving other DNs (in case of error, client may try another DN in the list).
As to the "data structure" of DN for block information, there is a block -> DNs in-memory mapping maintained by NN. Basically the mapping is a map. To update the map, DNs will periodically report its local replica of blocks to NN.
Client is free to choose the nearest DN for read. For this, the HDFS should be topology-aware. From the HDFS architecture doc:
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.

Checksum process in HDFS /Hadoop

I could not understand how checksum works in HDFS to identify corrupt blocks while file writing reading. can someone explain me in details ?
Have a look at Apache documentation regarding HDFS Architecture.
Data Integrity
It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software.
It works in below way.
The HDFS client software implements checksum checker. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file.
If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
If checksum of another Data node block matches with checksum of hidden file, system will serve these data blocks.
Have a look at Robustness section too. The solution will be incomplete without having look at data replication mechanism.
Each DataNode sends a Heartbeat message to the NameNode periodically.
A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message.
The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more.
DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary.
The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.
Example of replication scenario:
It depends configuration of the cluster
dfs.replication
(Assume that it is 3)
dfs.namenode.replication.min ( Assume that it is 1 )
In case of one Data Node is lost, Name node will recognize that a block is under-replicated. Then Name Node will replicate the data blocks until dfs.replicaiton is met.

Kafka message codec - compress and decompress

When using kafka, I can set a codec by setting the kafka.compression.codec property of my kafka producer.
Suppose I use snappy compression in my producer, when consuming the messages from kafka using some kafka-consumer, should I do something to decode the data from snappy or is it some built-in feature of kafka consumer?
In the relevant documentation I could not find any property that relates to encoding in kafka consumer (it only relates to the producer).
Can someone clear this?
As per my understanding goes the de-compression is taken care by the Consumer it self. As mentioned in their official wiki page
The consumer iterator transparently decompresses compressed data and only returns an uncompressed message
As found in this article the way consumer works is as follows
The consumer has background “fetcher” threads that continuously fetch data in batches of 1MB from the brokers and add it to an internal blocking queue. The consumer thread dequeues data from this blocking queue, decompresses and iterates through the messages
And also in the doc page under End-to-end Batch Compression its written that
A batch of messages can be clumped together compressed and sent to the server in this form. This batch of messages will be written in compressed form and will remain compressed in the log and will only be decompressed by the consumer.
So it appears that the decompression part is handled in the consumer it self all you need to do is to provide the valid / supported compression type using the compression.codec ProducerConfig attribute while creating the producer. I couldn't find any example or explanation where it says any approach for decompression in the consumer end. Please correct me if I am wrong.
I have the same issue with v0.8.1 and this compression decomression in Kafka is poorly documented other than saying the Consumer should "transparently" decompresses compressed data which it NEVER did.
The example high level consumer client using ConsumerIterator in Kafka web site only works with uncompressed data. Once I enable compression in Producer client, the message never gets into the following "while" loop. Hopefully they should fix this issue asap or they shouldn't claim this feature as some users may use Kafka to transport large size message that needs batching and compression capabilities.
ConsumerIterator <byte[], byte[]> it = stream.iterator();
while(it.hasNext())
{
String message = new String(it.next().message());
}