Hadoop saves files in form of blocks on data nodes(DN) and metadata about files is saved into namenodes(NN). Whenever any file is being read by a client then NN sends a read-pipeline (list of DNs) from where file blocks are to be picked up. Read pipeline consists of nearest DNs (w.r.t client) to serve read request.
I am curious to know how NN keeps information about DN for blocks of file. I mean the data structure. Is it a graph with information of all replicas location of DN? and later on while creating a read-pipeline some algo is used to find shortest path between client and corresponding DNs?
There is not really a term named "read pipeline". For read/write request, the granularity is block level. Generally the write pipeline is like: client -> DN1 -> DN2 -> DN3. For read, client connects the NN for the DN list that hold replicas of the block to read. Then the client reads the data from the nearest DN directly, without involving other DNs (in case of error, client may try another DN in the list).
As to the "data structure" of DN for block information, there is a block -> DNs in-memory mapping maintained by NN. Basically the mapping is a map. To update the map, DNs will periodically report its local replica of blocks to NN.
Client is free to choose the nearest DN for read. For this, the HDFS should be topology-aware. From the HDFS architecture doc:
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.
Related
We are using Flink 1.9.0 Dataset API to read CSV files from Amazon S3 Bucket. Facing connection pool timeout most of the times.
Following are the configurations at Flink level
Reading 19708 objects from s3 in a single go, as we need to apply the logic on top of whole data set. Lets take an eg: Imagine have 20 source folders eg( AAA, BBB, CCC ) with multiple subfolders (AAA/4May2020/../../1.csv,AAA/4May2020/../../2.csv, AAA/3May2020/../../1.csv ,AAA/3May2020/../../2.csv ....), for the read to happen, before calling the readCSV, the logic scan folders and pick the one only with latest date folder and pass that for read. For the read operation we use parallelism as "5". But when the execution graph is formed all 20 Sources comes together.
Running on Kube-Aws with around 10
Task Managers hosted under "m5.4X large machine". Task Manager
docker is allocated with "8" cores and "50GB" memory.
Following were tried to address the issue, but no luck so far. Really need some pointers and help to address this
Enabled the Flink retry mechanism with failover as "region", sometimes with retries it gets through. But even with retry it fails intermittently.
Revisited the core-site.xml as per AWS Site:
fs.s3a.threads.max:3000,fs.s3a.connection.maximum:4500
Also could anyone help with the following questions
Is there anyway to check if the HTTP connections opened by readCSV
are closed
Any pointers to understand how dataset ReadCSV operates
will help.
Any way to introduce a wait mechanisms before the
read?
Any better way to address this issue
I was reading the Google file system paper and am not sure how it deals with write (not atomic record append) failures at replicas. If it returns success, then it will wait until the next heartbeat for the master to get the updated state of the world and detect the corruption/stale chunk version and delete the chunk. I guess that the master can verify the validity of all the replicas whenever clients ask for replica locations, to prevent client from ever getting stale/corrupted data. Is this how it deals with replica write failure?
The invalid chunks can be divided into two categories: stale and corrupted checksum. They are checked in two different ways.
Stale chunk. The version of chunk is not up to date. The master checks stale chunks during regular heartbeat with chunk servers.
Corrupted checksum. Since divergent replicas may be legal, replicas across the GFS are not guaranteed to be identical. In addition, for performance consideration, the checksum is performed independently on the chunk servers themselves rather than on the master.
Checksum can be checked in two phases:
When clients or other chunk servers request for the chunk
When the chunk servers are during idle periods, they scan and verify the inactive chunks to avoid corrupted chunks are considered as valid replicas.
If the checksum is corrupted, the chunk server reports the master about the problem. The master clones the replica from other chunk servers with healthy replicas. After that, the master instructs the chunk server that reports the problem to delete the chunk.
Back to your question, how GFS deal with replica write failure?
If any error is encounter during replication, the failure of mutation is reported to the client. The client must handle the error and retry the mutation. The inconsistent chunks will be garbage collected during regular scan in chunk servers.
Your question is not very clear but I'll answer what I made an educated guess of what your question would be. It seems related to how chunk version number helps in stale replica detection-:
Clients asks to write in files.
Master sends metadata response to client. In that metadata response,a chunk version number is also included. This is indication to grant lease of a chunkserver to client.
Now, the master increments the chunk version number and asks all the chunkservers (chunks) to do the same once the WRITE is COMPLETED.
All this has happened before the master starts writing on file on the chunk.
Say the chunkserver crashed.
Once the chunkserver restarts, master-chunkserver communicates in heartbeat messages and compares the "chunk version number". The thing is, if the write was accomplished, then the chunk version number in all chunks/replicas should've been same as the master's chunk version number. If it's not the same, failure has occured during writing.
So the master decrements its chunk version number and during garbage collection, all those failed replicas are removed.
Suppose I have a the following two Actors
Store
Product
Every Store can have multiple Products and I want to dynamically split the Store into StoreA and StoreB on high traffic on multiple machines. The splitting of Store will also split the Products evenly between StoreA and StoreB.
My question is: what are the best practices of knowing where to send all the future BuyProduct requests to (StoreA or StoreB) after the split ? The reason I'm asking this is because if a request to buy ProductA is received I want to send it to the right store which already has that Product's state in memory.
Solution: The only solution I can think of is to store the path of each Product Map[productId:Long, storePath:String] in a ProductsPathActor every time a new Product is created and for every BuyProduct request I will query the ProductPathActor which will return the correct Store's path and then send the BuyProduct request to that Store ?
Is there another way of managing this in Akka or is my solution correct ?
One good way to do this is with Akka Cluster Sharding. From the docs:
Cluster sharding is useful when you need to distribute actors across
several nodes in the cluster and want to be able to interact with them
using their logical identifier, but without having to care about their
physical location in the cluster, which might also change over time.
There is an Activator Template that demonstrates it here.
To your problem, the concept of StoreA and StoreB are each a ShardRegion and map 1:1 with to your cluster nodes. The ShardCoordinator manages distribution between these nodes and acts as the conduit between regions.
For it's part, your Request Handler talks to a ShardRegion, which routes the message if necessary in conjunction with the coordinator. Presumably, there is a JVM-local ShardRegion for each Request Handler to talk to, but there's no reason that it could not be a remote actor.
When there is a change in the number of nodes, ShardCoordinator needs to move shards (i.e. the collections of entities that were managed by that ShardRegion) that are going to shut down in a process called "rebalancing". During that period, the entities within those shards are unavailable, but the messages to those entities will be buffered until they are available again. To this end, "being available" means that the new ShardRegion responds to a directed message for that entity.
It's up to you to bring that entity back to life on the new node. Akka Persistence makes this very easy, but requires you to use the Event Sourcing pattern in the process. This isn't a bad thing, as it can lead to web-scale performance much more easily. This is especially true when the database in use is something like Apache Cassandra. You will see that nodes are "passivated", which is essentially just caching off to disk so they can be restored on request, and Akka Persistence works with that passivation to transparently restore the nodes under the control of the new ShardRegion – essentially a "move".
When file is ingested using "hdfs dfs -put" client computes checksum and sends both input data+checksum to Datanode for storing.
How does this checksum calculatio/validation happen when File is read/write using WebHdfs ? how data integrity is insured with WebHdfs ?
Hadoop documentation on apache don't mention anything about it.
WebHDFS is just a proxy through the usual datanode operations. Datanodes host the webhdfs servlets, which open standard DFSClients and read or write data through the standard pipeline. It's an extra step in the normal process but does not fundamentally change it. Here is a brief overview.
I could not understand how checksum works in HDFS to identify corrupt blocks while file writing reading. can someone explain me in details ?
Have a look at Apache documentation regarding HDFS Architecture.
Data Integrity
It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software.
It works in below way.
The HDFS client software implements checksum checker. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file.
If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
If checksum of another Data node block matches with checksum of hidden file, system will serve these data blocks.
Have a look at Robustness section too. The solution will be incomplete without having look at data replication mechanism.
Each DataNode sends a Heartbeat message to the NameNode periodically.
A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message.
The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more.
DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary.
The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.
Example of replication scenario:
It depends configuration of the cluster
dfs.replication
(Assume that it is 3)
dfs.namenode.replication.min ( Assume that it is 1 )
In case of one Data Node is lost, Name node will recognize that a block is under-replicated. Then Name Node will replicate the data blocks until dfs.replicaiton is met.