HBase:Difference between Minor and Major Compaction - hdfs

I am having trouble understanding why major compaction is differ from minor compaction. As far as I know, minor compaction is that merge some HFiles into one or little more HFiles.
And I think major compaction does almost the same thing except handling deleted rows..
So, I have no idea why major compaction brings back data locality of HBase(when it is used over HDFS).
In other words, why minor compaction cannot restore data locality, despite the fact that for me, minor compaction and major compaction is all just merging HFiles into small amount of HFiles.
And why only major compaction dramatically improves read performance? I think minor compaction also contributes to the improvement of read performance.
Please help me to understand.
Thank you in advance.

Before understanding the difference between major and minor compactions, you need to understand the factors that impact performance from the point of view of compactions:
Number of files: Too many files negatively impact performance, due to file metadata and seek costs associated with each file.
Amount of data: Too much data means less performance. Now, this data could be useful or useless i.e. mostly consisting of what HBase calls Delete markers. These delete markers are used by HBase to mark a Cell/KeyValue that might be contained in an older Hfile as deleted.
Data locality: Since HBase regionserver are stateless processes, and the data is actually stored in HDFS, the data that a region server serves could be on a different physcial machine. How much of a regionserver's data is on the same machine counts towards data locality. While writing data, regionserver try to write the primary copy of data in the local HDFS data node. So, the cluster has a data locality of 100% or 1. But, due to regionserver restarts or region rebalancing or region splitting, the regions can move to a different machine than they originally started on thus reducing locality. Higher locality means better IO performance as HBase can then use something called short-circuit reads.
As you can imagine, the chances of having a poor locality for older data are higher due to restarts and rebalances.
Now, an easy way to understand the difference between minor and major compactions is as follows:
Minor Compaction: This compaction type is running all the time and focusses mainly on new files being written. By the virtue of being new, these files are small and can have delete markers for data in older files. Since this compaction is only looking at relatively newer files, it does not touch/delete data from older files. This means that until a different compaction type comes and deletes older data, this compaction type cannot remove the delete markers even from the newer files, otherwise those older deleted KeyValues will become visible again.
This leads to two outcomes:
As the files being touched are relatively newer and smaller, the capability to impact data locality is very low. In fact, during a write operation, a region server tries to write the primary replica of data on the local HDFS data node anyway. So, a minor compaction usually does not add much value to data locality.
Since the delete markers are not removed, some performance is still left on the table. That said, minor compactions are critical for HBase read performance as they keep the total file count under control which could be a big performance bottleneck especially on spinning disks if left unchecked.
Major Compaction: This type of compaction runs rarely (once a week by default) and focusses on complete cleanup of a store (one column family inside one region). The output of a major compaction is one file for one store. Since a major compaction rewrites all the data inside a store, it can remove both the delete markers and the older KeyValues marked as deleted by those delete markers.
This also leads to two outcomes:
Since delete markers and deleted data is physically removed, file sizes are reduced dramatically, especially in a system receiving a lot of delete operations. This can lead to a dramatic increase in performance in a delete-heavy environment.
Since all data of a store is being rewritten, it's a chance to restore the data locality for older (and larger) files also where the drift might have happened due to restarts and rebalances as explained earlier. This leads to better IO performance during reads.
More on HBase compactions: HBase Book

Related

Preloading data into RAM for fast transaction

My thinking is that, if we preload clients' data(account no, netbalance) in advance, and whenever a transaction is processed the txn record is written into RAM in FIFO data structure, and update also the clients' data in RAM, then after a certain period the record will be written into database in disk to prevent data lost from RAM due to volatility.
By doing so the time in I/O should be saved and hance less time for seeking clients' data for the aim (faster transaction).
I have heard about in-memory database but I do not know if my idea is same as that thing. Also, is there better idea than what I am thinking?
In my opinion, there are several aspects to think about / research to get a step forward. Pre-Loading and Working on data is usually faster than being bound to disk / database page access schemata. However, you are instantly loosing durability. Therefore, three approaches are valid in different situations:
disk-synchronous (good old database way, after each transaction data is guaranteed to be in permanent storage)
in-memory (good as long as the system is up and running, faster by orders of magnitude, risk of loosing transaction data on errors)
delayed (basically in-memory, but from time to time data is flushed to disk)
It is worth noting that delayed is directly supported on Linux through Memory-Mapped files, which are - on the one hand side - often as fast as usual memory (unless reading and accessing too many pages) and on the other hand synced to disk automatically (but not instantly).
As you tagged C++, this is possibly the simplest way of getting your idea running.
Note, however, that when assuming failures (hardware, reboot, etc.) you won't have transactions at all, because it is non-trivial to concretely tell, when the data is actually written.
As a side note: Sometimes, this problem is solved by writing (reliably) to a log file (sequential access, therefore faster than directly to the data files). Search for the word Compaction in the context of databases: This is the operation to merge a log with the usually used on-disk data structures and happens from time to time (when the log gets too large).
To the last aspect of the question: Yes, in-memory databases work in main memory. Still, depending on the guarantees (ACID?) they give, some operations still involve hard disk or NVRAM.

Which Key value, Nosql database can ensure no data loss in case of a power failure?

At present, we are using Redis as an in-memory, fast cache. It is working well. The problem is, once Redis is restarted, we need to re-populate it by fetching data from our persistent store. This overloads our persistent
store beyond its capacity and hence the recovery takes a long time.
We looked at Redis persistence options. The best option (without compromising performance) is to use AOF with 'appendfsync everysec'. But with this option, we can loose last second data. That is not acceptable. Using AOF with 'appednfsync always' has a considerable performance penalty.
So we are evaluating single node Aerospike. Does it guarantee no data loss in case of power failures? i.e. In response to a write operation, once Aerospike sends success to the client, the data should never be lost, even if I pull the power cable of the server machine. As I mentioned above, I believe Redis can give this guarantee with the 'appednfsync always' option. But we are not considering it as it has the considerable performance penalty.
If Aerospike can do it, I would want to understand in detail how persistence works in Aerospike. Please share some resources explaining the same.
We are not looking for a distributed system as strong consistency is a must for us. The data should not be lost in node failures or split brain scenarios.
If not aerospike, can you point me to another tool that can help achieve this?
This is not a database problem, it's a hardware and risk problem.
All databases (that have persistence) work the same way, some write the data directly to the physical disk while others tell the operating system to write it. The only way to ensure that every write is safe is to wait until the disk confirms the data is written.
There is no way around this and, as you've seen, it greatly decreases throughput. This is why databases use a memory buffer and write batches of data from the buffer to disk in short intervals. However, this means that there's a small risk that a machine issue (power, disk failure, etc) happening after the data is written to the buffer but before it's written to the disk will cause data loss.
On a single server, you can buy protection through multiple power supplies, battery backup, and other safeguards, but this gets tricky and expensive very quickly. This is why distributed architectures are so common today for both availability and redundancy. Distributed systems do not mean you lose consistency, rather they can help to ensure it by protecting your data.
The easiest way to solve your problem is to use a database that allows for replication so that every write goes to at least 2 different machines. This way, one machine losing power won't affect the write going to the other machine and your data is still safe.
You will still need to protect against a power outage at a higher level that can affect all the servers (like your entire data center losing power) but you can solve this by distributing across more boundaries. It all depends on what amount of risk is acceptable to you.
Between tweaking the disk-write intervals in your database and using a proper distributed architecture, you can get the consistency and performance requirements you need.
I work for Aerospike. You can choose to have your namespace stored in memory, on disk or in memory with disk persistence. In all of these scenarios we perform favourably in comparison to Redis in real world benchmarks.
Considering storage on disk when a write happens it hits a buffer before being flushed to disk. The ack does not go back to the client until that buffer has been successfully written to. It is plausible that if you yank the power cable before the buffer flushes, in a single node cluster the write might have been acked to the client and subsequently lost.
The answer is to have more than one node in the cluster and a replication-factor >= 2. The write then goes to the buffer on the client and the replica and has to succeed on both before being acked to the client as successful. If the power is pulled from one node, a copy would still exist on the other node and no data would be lost.
So, yes, it is possible to make Aerospike as resilient as it is reasonably possible to be at low cost with minimal latencies. The best thing to do is to download the community edition and see what you think. I suspect you will like it.
I believe aerospike would serves your purpose, you can configure it for hybrid storage at namespace(i.e. DB) level in aerospike.conf
which is present at /etc/aerospike/aerospike.conf
For details please refer official documentation here: http://www.aerospike.com/docs/operations/configure/namespace/storage/
I believe you're going to be at the mercy of the latency of whatever the storage medium is, or the latency of the network fabric in the case of cluster, regardless of what DBMS technology you use, if you must have a guarantee that the data won't be lost. (N.B. Ben Bates' solution won't work if there is a possibility that the whole physical plant loses power, i.e. both nodes lose power. But, I would think an inexpensive UPS would substantially, if not completely, mitigate that concern.) And those latencies are going to cause a dramatic insert/update/delete performance drop compared to a standalone in-memory database instance.
Another option to consider is to use NVDIMM storage for either the in-memory database or for the write-ahead transaction log used to recover from. It will have the absolute lowest latency (comparable to conventional DRAM). And, if your in-memory database will fit in the available NVDIMM memory, you'll have the fastest recovery possible (no need to replay from a transaction log) and comparable performance to the original IMDB performance because you're back to a single write versus 2+ writes for adding a write-ahead log and/or replicating to another node in a cluster. But, your in-memory database system has to be able to support direct recovery of an in-memory database (not just from a transaction log). But, again, two requirements for this to be an option:
1. The entire database must fit in the NVDIMM memory
2. The database system has to be able to support recovery of the database directly after system restart, without a transaction log.
More in this white paper http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-paper.pdf

Predictively computing potentially needed values on large shared data structure with infrequent updates

I have a system I need to design with low latency in mind, processing power and memory are generous. I have a large (several GB) data structure that is updated once every few seconds. Many (read only) operations are going to run against this data structure between updates, in parallel, accessing it heavily. As soon as an update occurs, all computations in progress should be cleanly cancelled, as their results are invalidated by the update.
The issue I'm running into here is that writes are infrequent enough, and readers access so often that locking around individual reader access would have a huge hit to performance. I'm fine with the readers reading invalid data, but then I need to deal with any invariants broken (assertions) or segfaults due to stale pointers, etc. At the same time, I can't have readers block writers, so reader-writer locks acquired at every reader's thread start is unacceptable.
The only solution I can think of has a number of issues, which is to allocate a mapping with mmap, put the readers in separate processes, and mprotect the memory to kill the workers when it's time to update. I'd prefer a cross-platform solution (ideally pure C++), however, and ideally without forking every few seconds. This would also require some surgery to get all the data structures located in shm.
Something like a revocable lock would do exactly what I need, but I don't know of any libraries that provide such functionality.
If this was a database I'd use multi-versions concurrency control. Readers obtain a logical snapshot while the underlying physical data structures are mostly lock-free (or locked very shortly and fine-grainedly).
You say your memory is generously equipped. Can you just create a complete copy of the data structure? Then you modify the copy and swap it out atomically.
Or, can you use immutable data-structures so that readers continue to use the old version and the writer creates new objects?
Or, you implement MVCC in a fine-grained way. Let's say you want to version a hash-set. Instead of keeping one value per key, you keep one value per key per version. Readers read from the latest version that is <= the version that existed when they started to read. Writers create a new version number for each write "transaction". Only when all writes are complete readers would start picking up changes from the new version. This is how MVCC-databases do it.
Besides these approaches I also liked your mmap idea. I don't think you need a separate process is your OS supports copy-on-write memory mappings. Then you can map the same memory area multiple times and provide a stable snapshot to readers.

What is cache in C++ programming? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Firstly I would like to tell that I come from a non-Computer Science background & I have been learning the C++ language.
I am unable to understand what exactly is a cache?
It has different meaning in different contexts.
I would like to know what would be called as a cache in a C++ program?
For example, if I have some int data in a file. If I read it & store in an int array, then would this mean that I have 'cached' the data?
To me this seems like common sense to use the data since reading from a file is always bad than reading from RAM.
But I am a little confused due to this article.
In a CPU there can be several caches, to speed up instructions in
loops or to store often accessed data. These caches are small but very
fast. Reading data from cache memory is much faster than reading it
from RAM.
It says that reading data from cache is much faster than from RAM.
I thought RAM & cache were the same.
Can somebody please clear my confusion?
EDIT: I am updating the question because previously it was too broad.
My confusion started with this answer. He says
RowData and m_data are specific to my implementation, but they are
simply used to cache information about a row in the file
What does cache in this context mean?
Any modern CPU has several layers of cache that are typically named things like L1, L2, L3 or even L4. This is called a multi-level cache. The lower the number, the faster the cache will be.
It's important to remember that the CPU runs at speeds that are significantly faster than the memory subsystem. It takes the CPU a tiny eternity to wait for something to be fetched from system memory, many, many clock-cycles elapse from the time the request is made to when the data is fetched, sent over the system bus, and received by the CPU.
There's no programming construct for dealing with caches, but if your code and data can fit neatly in the L1 cache, then it will be fastest. Next is if it can fit in the L2, and so on. If your code or data cannot fit at all, then you'll be at the mercy of the system memory, which can be orders of magnitude slower.
This is why counter-intuitive things like unrolling loops, which should be faster, might end up being slower because your code becomes too large to fit in cache. It's also why shaving a few bytes off a data structure could pay huge dividends even though the memory footprint barely changes. If it fits neatly in the cache, it will be faster.
The only way to know if you have a performance problem related to caching is to benchmark very carefully. Remember each processor type has varying amounts of cache, so what might work well on your i7 CPU might be relatively terrible on an i5.
It's only in extremely performance sensitive applications that the cache really becomes something you worry about. For example, if you need to maintain a steady 60FPS frame rate in a game, you'll be looking at cache problems constantly. Every millisecond counts here. Likewise, anything that runs the CPU at 100% for extended periods of time, such as rendering video, will want to pay very close attention to how much they could gain from adjusting the code that's emitted.
You do have control over how your code is generated with compiler flags. Some will produce smaller code, some theoretically faster by unrolling loops and other tricks. To find the optimal setting can be a very time-consuming process. Likewise, you'll need to pay very careful attention to your data structures and how they're used.
[Cache] has different meaning in different contexts.
Bingo. Here are some definitions:
Cache
Verb
Definition: To place data in some location from which it can be more efficiently or reliably retrieved than its current location. For instance:
Copying a file to a local hard drive from some remote computer
Copying data into main memory from a file on a local hard drive
Copying a value into a variable when it is stored in some kind of container type in your procedural or object oriented program.
Examples: "I'm going to cache the value in main memory", "You should just cache that, it's expensive to look up"
Noun 1
Definition: A copy of data that is presumably more immediately accessible than the source data.
Examples: "Please keep that in your cache, don't hit our servers so much"
Noun 2
Definition: A fast access memory region that is on the die of a processor, modern CPUs generally have several levels of cache. See cpu cache, note that GPUs and other types of processors will also have their own caches with different implementation details.
Examples: "Consider keeping that data in an array so that accessing it sequentially will be cache coherent"
My definition for Cache would be some thing that is in limited amount but faster to access as there is less area to look for. If you are talking about caching in any programming language then it means you are storing some information in form of a variable(variable is nothing a way to locate your data in memory) in memory. Here memory means both RAM and physical cache (CPU cache).
Physical/CPU cache is nothing but memory that is even more used than RAM, it actually stores copies of some data on RAM which is used by CPU very often. You have another level of categorisation after that as well which is on board cache(faster) and off-board cache. youu can see this link
I am updating the question because previously it was too broad. My
confusion started with this answer. He says
RowData and m_data are specific to my implementation,
but they are simply used to cache information about a row in the file
What does cache in this context mean?
This particular use means that RowData is held as a copy in memory, rather than reading (a little bit of) the row from a file every time we need some data from it. Reading from a file is a lot slower [1] than holding on to a copy of the data in our program's memory.
[1] Although in a modern OS, the actual data from the hard-disk is probably held in memory, in file-system cache, to avoid having to read the disk many times to get the same data over and over. However, this still means that the data needs to be copied from the file-system cache to the application using the data.

Streaming, in-place binary patching

I have series of large binary files, each of which is produced by modifying the previous one. They are stored on a server (the server is just a dumb file store, we can't run programs on it).
To save space I want to store them as diffs. The problem comes when we download the files: they are so large that there is not enough disk space on the client to store both the original file and a diff.
Is there a diff algorithm which will allow us to download the original file to disk, and then apply a patch as it is streamed from the server, in place? AIUI, both xdelta and rdiff can't modify the original file, only create a new copy (which will take too much disk space).
The short answer is sadly no. Though...
The problem with in place patching is the mix of insertion and reference to old data. Insertions need existing data to be moved around to let enough place to insert, that is copying backward the end of file (and in the general case it will be pretty slow, and take the size of the file itself in worst case). The reference to old data would need to be extremely cautious to reduce the worst case...
With all the constraints to make this possible on client with a real advantage in term of occupied space during patching, the patch would be probably far bigger than what xdelta or rdiff would do. The patching process would be far slower as well.
One possibility with an intelligent server would be to:
compose all the patches into one meta-patch
stream the original file patched on the fly (thus making the recomposition on the wire)
The realization idea of ​​sync diff:
Clip the data into many blocks with a fixed size;
Calculate the blocks that need to be retained in the old data;(can use sync algorithm)
Sort retained blocks to a new location by it's in new data order;
Download the missing blocks to fill;
Among them, the most difficult is optimize the speed of sort blocks, Need to make full use of memory or remaining disk space in order to minimize disk I/O;