In hdfs why do nodes seem so unreliable? - hdfs

In this article the author talks about data reliability: blocks are duplicated among the datanodes to ensure that data is preserved when a node crashes. I do understand the concept, but what would make a node crash ? Does this happen very often in practice ?

hdfs uses commodity hardware i.e nodes are built upon cheap hardware to decrease the overall cost.
Keeping this in mind blocks are duplicated.

Related

When to Add Shards to a Distributed Key/Value Store

I've been reading up on distributed systems lately, and I've seen a lot of examples of how to shard key value stores, like a memcache system or a nosql db.
In general, adding shards makes intuitive sense to me when you want to support more concurrent access to the table, and most of the examples cover that sort of usage. One thing I'm not clear on though is whether you are also supposed to add shards as your total table size grows. For something like a memcache, I'd imagine this is necessary, because you need more nodes with more memory to hold more key/values. But what about databases which also keep the values on some sort of hard drive?
It seems like, if your table size is growing but the amount of concurrent access is not, it would be somewhat wasteful to keep adding nodes just to hold more data. In that case I'd think you could just add more long-term storage. But I suppose the problem is, you are increasing the chance that your data becomes "cold" when somebody needs it, causing more latency for those requests.
Is there a standard approach to scaling nodes vs. storage? Are they always linked? Thanks much for any advice.
I think it is the other way around.
Almost always shards are added because the data is growing to the point where it cannot be held on 1 machine
Sharding makes everything so much more painful so should be only done when vertical scaling doesn't work anymore

HBase:Difference between Minor and Major Compaction

I am having trouble understanding why major compaction is differ from minor compaction. As far as I know, minor compaction is that merge some HFiles into one or little more HFiles.
And I think major compaction does almost the same thing except handling deleted rows..
So, I have no idea why major compaction brings back data locality of HBase(when it is used over HDFS).
In other words, why minor compaction cannot restore data locality, despite the fact that for me, minor compaction and major compaction is all just merging HFiles into small amount of HFiles.
And why only major compaction dramatically improves read performance? I think minor compaction also contributes to the improvement of read performance.
Please help me to understand.
Thank you in advance.
Before understanding the difference between major and minor compactions, you need to understand the factors that impact performance from the point of view of compactions:
Number of files: Too many files negatively impact performance, due to file metadata and seek costs associated with each file.
Amount of data: Too much data means less performance. Now, this data could be useful or useless i.e. mostly consisting of what HBase calls Delete markers. These delete markers are used by HBase to mark a Cell/KeyValue that might be contained in an older Hfile as deleted.
Data locality: Since HBase regionserver are stateless processes, and the data is actually stored in HDFS, the data that a region server serves could be on a different physcial machine. How much of a regionserver's data is on the same machine counts towards data locality. While writing data, regionserver try to write the primary copy of data in the local HDFS data node. So, the cluster has a data locality of 100% or 1. But, due to regionserver restarts or region rebalancing or region splitting, the regions can move to a different machine than they originally started on thus reducing locality. Higher locality means better IO performance as HBase can then use something called short-circuit reads.
As you can imagine, the chances of having a poor locality for older data are higher due to restarts and rebalances.
Now, an easy way to understand the difference between minor and major compactions is as follows:
Minor Compaction: This compaction type is running all the time and focusses mainly on new files being written. By the virtue of being new, these files are small and can have delete markers for data in older files. Since this compaction is only looking at relatively newer files, it does not touch/delete data from older files. This means that until a different compaction type comes and deletes older data, this compaction type cannot remove the delete markers even from the newer files, otherwise those older deleted KeyValues will become visible again.
This leads to two outcomes:
As the files being touched are relatively newer and smaller, the capability to impact data locality is very low. In fact, during a write operation, a region server tries to write the primary replica of data on the local HDFS data node anyway. So, a minor compaction usually does not add much value to data locality.
Since the delete markers are not removed, some performance is still left on the table. That said, minor compactions are critical for HBase read performance as they keep the total file count under control which could be a big performance bottleneck especially on spinning disks if left unchecked.
Major Compaction: This type of compaction runs rarely (once a week by default) and focusses on complete cleanup of a store (one column family inside one region). The output of a major compaction is one file for one store. Since a major compaction rewrites all the data inside a store, it can remove both the delete markers and the older KeyValues marked as deleted by those delete markers.
This also leads to two outcomes:
Since delete markers and deleted data is physically removed, file sizes are reduced dramatically, especially in a system receiving a lot of delete operations. This can lead to a dramatic increase in performance in a delete-heavy environment.
Since all data of a store is being rewritten, it's a chance to restore the data locality for older (and larger) files also where the drift might have happened due to restarts and rebalances as explained earlier. This leads to better IO performance during reads.
More on HBase compactions: HBase Book

Main causes for leader node to be at high CPU

I often see the leader node of our Redshift cluster peak at 100% CPU. I've identified one possible cause for this: many concurrent queries, and so many execution plans for the leader to calculate. This hypothesis seems very likely as the times we have the most queries coming seem to be the same as the ones we see the leader at 100%.
To fix this at best, we are wondering: are there other main possible causes for a high CPU on the leader?
(I'm precising the situation is when only the leader node is at high CPU and the workers seem fine)
The Redshift leader node is the same size and class of compute as the compute nodes. Typically this means that the leader is over provisioned for the role it plays but since its role is so important and impactful if things slows down, it is good that it is over provisioned. The leader needs to compile and optimized the queries and perform final steps in queries (final sort for example). It communicates with the session clients and handles all their requests. If the leader becomes overloaded all these activities slow down creating significant performance issues. It is not good that your leader is hitting 100% CPU often enough for you to notice. I bet the seems sluggish when this happens.
There are a number of ways I've seen "leader abuse" and it usually becomes a problem when bad patterns are copied between users. In no particular order:
Large data literals in queries (INSERT ... VALUES ...). This puts your data through the query compiler on the leader node. This is not what it is design to do and is very expensive for the leader. Use the COPY command to bring data into the cluster. (Just bad, don't do this)
Over use of COMMIT. A commits cause an update to the coherent state of the database and needs to run through the "commit queue" and creates work for the leader and the compute nodes. Having COMMITs every other statement can cause this queue to back up and work to generally back up.
Too many slots defined in the WLM. Redshift can typically only efficiently run between 1 and 2 dozen queries at once. Setting the total slot count very high (like 50) can lead to very inefficient operation and high CPU loads. Depending on workload this can show up for compute or occasionally the lead node.
Large data output through SELECT statements. SELECTs return data but when this data is many GBs in size the management of this data movement (and sorting) is done by the leader node. If large amounts of data need to be extracted from Redshift it should be done with an UNLOAD statement.
Overuse of large cursors. Cursors can be an important tool and needed for many BI tools but cursors are located on the leader and overuse can lead to reduced leader attention on other tasks.
Many / large UNLOADs with parallel off. UNLOADs generally come from the compute nodes straight to S3 but with "parallel off" all the data is routed to the leader node where it is combined (sorted) and sent to S3.
While none of the above of problems in and of themselves, it is when these are overused, used in ways they are not intended, or all at once that the leader starts to be impacted. It also comes down to what you intend to do with your cluster - if it support BI tools then you may have a lot of cursors but this load on the leader is part of the cluster's intent. Issue often arise when the cluster's intent is to all things to everybody.
If your workload for Redshift is leader function heavy and you are efficiently using the leader node (no large literals, using COPY and UNLOAD, etc.) then high leader workload is what you want. You're getting the most out of the critical resource. However, most use Redshift to perform analytics on large data which is the function of the compute nodes. A highly loaded leader can detract significantly from this mission and needs to be addressed.
Another way that leader can get stressed is when clusters are configured with many smaller node types instead of fewer bigger nodes. Since the leader is the same size as the compute nodes many smaller nodes means you have a small leader doing the work. Something to consider but I'd make sure you don't have unneeded leader node stressers before investing in a resize.
Whenever you execute some commands which require calculation on the leader node, whether for dispatching data, computing statistics, or aggregating results from the workers, like COPY, UNLOAD, VACUUM, ANALYZE, you'll see an increase in CPU usage. More information about this here: https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html

How do clients of a distributed blockchain know about consensus?

I have a basic blockchain I wrote to explore and learn more about the technology. The only real world experience I have with them is in a one-to-one transaction from client to server, as a record of transactions. I'm interested in distributed blockchains now.
In its simplest, most theoretical form, how is consensus managed? How do peers know to begin writing transactions on the next block? You have to know when >50% of the entire pool has accepted some last block written. But p2p systems can be essentially unbounded, and you can't trust a third party to handle surety, so how is this accomplished?
edit: I now know roughly how bitcoin handles consensus:
The consensus determines the accepted blockchain. The typical rule of "longest valid chain first" ensures that only one variant is accepted. People may accept a blockchain after any number of confirmations, typically 6 is sufficient to ensure a clear winner.
However, this seems like a slow and least-deliberate method. It ensures that there is a certain amount of wasted work on the part of nodes that happen to be in a part of the network that had a local valid solution at roughly the same time as a generally accepted solution.
Are there better alternatives?
Interesting question. I would say the blockchain technology solves only probabilistic consensus. With a certain confidence, the blockchain-network agrees on something.
Viewing blockchain as a distributed system we can say that the state of blockchain is distributed: the blockchain is kept as a whole but there are many distributed replicas of local copies. More interestingly, the operations are distributed: Writes or reads can happen at different nodes concurrently. Read operations can be done locally at the local copy of the blockchain, but this read can of course be stale if your local copy is not up-to-date, however there is always an incentive for nodes in the blockchain network to keep their local copy up-to-date so that they can complete new transactions when necessary.
Write operations is the tricky part here, that blockchain must solve. As writes happen concurrently in a distributed fashion, blockchain must ensure to avoid inconsistencies such as double spending and somehow reach consensus on the current state. The way blockchain does this is probabilistic, first of all they made it expensive to write to the chain by adding the "puzzle" to be solved, reducing the probability that different distributed writes happen concurrently, but they can still happen, but with lower probability. In addition, as there is an incentive for nodes in the network to keep their state up to date, nodes that received the flooded write operation will validate it and accept that operation into their chain. I think the incentive to always keep the chain up-to-date is key here because that ensures that the chain will make progress. I.e a writer has a clear incentive to keep its chain up-to-date since it will be competing with the "longest-chain-first" principle against other concurrent writers. For non-adversarial miners there is also an incentive to interrupt the current mining, accept a new write-block and restart the mining process, ensuring a sort of liveness in the system.
So blockchain relies on probabilistic consensus, what is the probability then? The probability that two exactly equal branches growing in parallel at the same time is close to 0 assuming that there are not any large group of adversarial nodes taking over the network. With very high probability one branch will be longer than the other and be accepted and the network reach consensus on that branch and write operations in the shorter branch have to be re-tried. The big concern is of course big adversarial miner groups who might deliberately try to create forks in the blockchain to perform double spending attacks.. but that is only likely to succeed if they get close to 50% of the computational power in the network.
So to conclude: natural branching in blockchain that can happen due to probabilistic reasons of concurrent writes (probability reduced due to the puzzle-solving) will with almost 100% probability converge to a single branch as write operations continue to happen, and the network reaches consensus on a single branch.
However, this seems like a slow and least-deliberate method. It
ensures that there is a certain amount of wasted work on the part of
nodes that happen to be in a part of the network that had a local
valid solution at roughly the same time as a generally accepted
solution.
Are there better alternatives?
Not that I can think of, there would be many more efficient solutions if all peers in the system "were under control" and you could make them follow some protocol and perhaps have a designated leader to tell the order of writes and ensure consensus, but that is not possible in a decentralized open system.
In the permissioned blockchain environment, where the participants are known in advance, client can get cryptographic proof of the consensus (e.g. that it was signed at least by 2/3 of the participants) and to verify it. Usually it can be achieved using threshold signatures.
In the public blockchains, AFAIK, there is no way to do this since the number of participants is unknown/changes all the time.

Which Key value, Nosql database can ensure no data loss in case of a power failure?

At present, we are using Redis as an in-memory, fast cache. It is working well. The problem is, once Redis is restarted, we need to re-populate it by fetching data from our persistent store. This overloads our persistent
store beyond its capacity and hence the recovery takes a long time.
We looked at Redis persistence options. The best option (without compromising performance) is to use AOF with 'appendfsync everysec'. But with this option, we can loose last second data. That is not acceptable. Using AOF with 'appednfsync always' has a considerable performance penalty.
So we are evaluating single node Aerospike. Does it guarantee no data loss in case of power failures? i.e. In response to a write operation, once Aerospike sends success to the client, the data should never be lost, even if I pull the power cable of the server machine. As I mentioned above, I believe Redis can give this guarantee with the 'appednfsync always' option. But we are not considering it as it has the considerable performance penalty.
If Aerospike can do it, I would want to understand in detail how persistence works in Aerospike. Please share some resources explaining the same.
We are not looking for a distributed system as strong consistency is a must for us. The data should not be lost in node failures or split brain scenarios.
If not aerospike, can you point me to another tool that can help achieve this?
This is not a database problem, it's a hardware and risk problem.
All databases (that have persistence) work the same way, some write the data directly to the physical disk while others tell the operating system to write it. The only way to ensure that every write is safe is to wait until the disk confirms the data is written.
There is no way around this and, as you've seen, it greatly decreases throughput. This is why databases use a memory buffer and write batches of data from the buffer to disk in short intervals. However, this means that there's a small risk that a machine issue (power, disk failure, etc) happening after the data is written to the buffer but before it's written to the disk will cause data loss.
On a single server, you can buy protection through multiple power supplies, battery backup, and other safeguards, but this gets tricky and expensive very quickly. This is why distributed architectures are so common today for both availability and redundancy. Distributed systems do not mean you lose consistency, rather they can help to ensure it by protecting your data.
The easiest way to solve your problem is to use a database that allows for replication so that every write goes to at least 2 different machines. This way, one machine losing power won't affect the write going to the other machine and your data is still safe.
You will still need to protect against a power outage at a higher level that can affect all the servers (like your entire data center losing power) but you can solve this by distributing across more boundaries. It all depends on what amount of risk is acceptable to you.
Between tweaking the disk-write intervals in your database and using a proper distributed architecture, you can get the consistency and performance requirements you need.
I work for Aerospike. You can choose to have your namespace stored in memory, on disk or in memory with disk persistence. In all of these scenarios we perform favourably in comparison to Redis in real world benchmarks.
Considering storage on disk when a write happens it hits a buffer before being flushed to disk. The ack does not go back to the client until that buffer has been successfully written to. It is plausible that if you yank the power cable before the buffer flushes, in a single node cluster the write might have been acked to the client and subsequently lost.
The answer is to have more than one node in the cluster and a replication-factor >= 2. The write then goes to the buffer on the client and the replica and has to succeed on both before being acked to the client as successful. If the power is pulled from one node, a copy would still exist on the other node and no data would be lost.
So, yes, it is possible to make Aerospike as resilient as it is reasonably possible to be at low cost with minimal latencies. The best thing to do is to download the community edition and see what you think. I suspect you will like it.
I believe aerospike would serves your purpose, you can configure it for hybrid storage at namespace(i.e. DB) level in aerospike.conf
which is present at /etc/aerospike/aerospike.conf
For details please refer official documentation here: http://www.aerospike.com/docs/operations/configure/namespace/storage/
I believe you're going to be at the mercy of the latency of whatever the storage medium is, or the latency of the network fabric in the case of cluster, regardless of what DBMS technology you use, if you must have a guarantee that the data won't be lost. (N.B. Ben Bates' solution won't work if there is a possibility that the whole physical plant loses power, i.e. both nodes lose power. But, I would think an inexpensive UPS would substantially, if not completely, mitigate that concern.) And those latencies are going to cause a dramatic insert/update/delete performance drop compared to a standalone in-memory database instance.
Another option to consider is to use NVDIMM storage for either the in-memory database or for the write-ahead transaction log used to recover from. It will have the absolute lowest latency (comparable to conventional DRAM). And, if your in-memory database will fit in the available NVDIMM memory, you'll have the fastest recovery possible (no need to replay from a transaction log) and comparable performance to the original IMDB performance because you're back to a single write versus 2+ writes for adding a write-ahead log and/or replicating to another node in a cluster. But, your in-memory database system has to be able to support direct recovery of an in-memory database (not just from a transaction log). But, again, two requirements for this to be an option:
1. The entire database must fit in the NVDIMM memory
2. The database system has to be able to support recovery of the database directly after system restart, without a transaction log.
More in this white paper http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-paper.pdf