How in-memory databases store data larger than RAM memory in GemfireXD? - in-memory-database

If I am using a cluster of 4 nodes, each having 4GB RAM, so total RAM memory is 16 GB. And I have to store 20 GB of data in a table.
Then how in-memory database will accommodate this data ? I read somewhere the data is swapped between RAM & Disk , but wouldn't it make data access slow. Please Explain

GemFire or GemFireXD evicts the data to disk if it feels memory pressure while accommodating more data.
It may have some performance implications. However, user can control how and when eviction takes place. All the algorithms use Least Recently Used algorithms to evict the data.
Also, when a row is evicted, the primary key value remains in memory while the remaining column data is evicted. This makes fetching the row from disk faster.
You can go through the following links to understand about evictions in GemFireXD:
http://gemfirexd.docs.pivotal.io/1.3.0/userguide/developers_guide/topics/cache/cache.html

HANA offers the possibility to unload data from the main memory. Since the data is then stored on the harddisc, queries accessing this data will run slowlier of course. Have a look at the hot/warm/cold data concept if you haven't heard about it.
This article gives you additional information about this topic: http://scn.sap.com/community/bw-hana/blog/2014/02/14/sap-bw-on-hana-data-classification-hotwarmcold

Though the question only targeted SQLITE & HANA wanted to share some insights on Oracle's Database Inmemory. It achieves loading huge tables into inmemory area by using various compression algorithms. Data populated into the IM column store is compressed using a new set of compression algorithms that not only help save space but also improve query performance. For example, table with 10GB in size when compressed with capacity high sizes to 3GB. This allows, table whose size greater than RAM be stored in a compressed format in inmemory area.

The OP specifically asked about a cluster, so that rules out SQLite (at least out of the box). You need a DBMS that can:
treat the 4 X 4GB of memory as 16GB of "storage" (IOW distribute the data across the noes of the cluster, but treat it as a whole)
compress the data to squeeze the 20GB of raw data into the available 16GB
eXtremeDB is one such solution. So is Oracle's Database In-Memory (with RAC). I'm sure there are others.

If you configure your tables so, GemFireXD can use offheap memory to be able to store larger amount of data in memory, consequently pushing off the need to evict data onto disk a bit farther (although reads of evicted data are optimized for faster lookup because the lookup keys are in memory)
http://gemfirexd.docs.pivotal.io/1.3.1/userguide/data_management/off-heap-guidelines.html

Related

What is the max size of Django cache and is there any chance to extend that?

I'm working on a project where i have to deal with big data (big number of rows)
To make the app fast I'm using redis with Django cache , I'm selecting all the data from Table A and saving it to cache as a json array, and later I select, update and delete from cache, I also have other tables B, c, D ... etc
Each table could hold more then 3 million rows in the future, probably more and more ...
I was wounding if Django cache can handle that?
If not what's the maximum size or rows that Django cache can store?
Is there a way to extend that, if not what are the order solutions for this problem?
I don't want to select data always from database, Because speed does matters here
It totally depends on your RAM strength. You have no limits in Redis. datasize is limited to the amount of RAM you've allowed Redis to use.
don`t worry ,3 milions or more is an small data!
but if you have very big data you can try one of the following approach:
Store the actual content/data on disk and store the index of data in
redis. You can use secondary indexes by redis.
Encode/Compress your data using any loss-less compression technique
and then store in redis. Redis will not compress data for you.

Redshift table sizes & flavours of

Confused by the term 'table size' in Redshift.
We have :
svv_table_info.size
"Size of table in 1MB blocks"
svv_table_info.pct_used
"Percent of available space used"
... so I assume that a lot of the 'size' is empty space due to sort keys etc
Then we have this..
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
.. which uses the term 'minimum' table size.
But nowhere can I find an explanation of what they means in the real world ? Is this a theoretical minimum if optimally configured ?
Ultimately I need to find out the basic size of original tangible data without any overheads.
Then yes, how much disc space is it actually costing to store it in Redshift.
So if I took 1TB out of our on-prem database and shoved it into Redshift, I'd be looking to see something like 1TB (data) & 1.2TB (data + Redshift overheads).
Hope someone can help clarify 🤞
Redshift stores data in 1MB blocks and blocks are associated with a slice and a column. So if I have 2 slices in my cluster and a table with 4 columns (plus the 3 system columns to make 7) distributed as EVEN containing at least 2 rows, then my table will minimally take up 2 X 7 X 1MB of space (14MB on disk). This is all that article is saying.
Now if I insert 2 additional rows into this table, Redshift will makes new blocks for this data. So now my 4 rows of data take up 28MB of space. However, if I Vacuum the table the wasted space will be reclaimed and the table size will come back down to 14MB. (yes this is a bit of an oversimplification but trying to get the concepts across)
As a rule of thumb a single 1MB block will typically hold between 100,000 rows and 2,000,000 rows of compressed data. (yes this depends on the data not being monster varchars) So for our table above I can keep adding rows (and vacuuming) without increasing the table size on disk until I get a few hundred thousand rows (per slice) in the table. Redshift is very efficient at storing large chunks of data but very inefficient at storing small ones.
What Redshift knows about your data size is how many blocks it takes on disk (across all the nodes, slices, and columns). How big your data would be if it was stored differently (not in blocks, compressed or uncompressed) is not data that is tracked. As John noted, for big tables, Redshift stores data more efficiently than most other database (when compression is used).
You cannot translate from an existing database size to the size of a table in Redshift. This is because:
Columns are stored separately
Minimum block size is 1MB
Data in Redshift is compressed, so it can take considerably less space depending on the type of data and the compression type chosen
Given compression, your data is likely to be smaller in Redshift than an original (uncompressed) data source. However, you can't really calculate that in advance unless you have transferred similar data in the past and apply a similar ratio.

VoltDB is exhausting the RAM while loading the data

I am trying to load the database tables into VoltDB database using csvloader utility of VoltDB. When I am trying to load one table of size 5GB, Voltdb eats the RAM so fast that free RAM become 200 MB from 55 GB, then the VoltDB process gets killed by the system.
What can be the reason for this and what are the recommended setting for VoltDB to avoid this?
Is the table you are loading partitioned? That's the first thing to check, because if you have the default sitesperhost=8 on a single server, and the table is not partitioned, there will be a complete copy of the table in each of the 8 partitions. If the table is partitioned, the data is distributed among the partitions based on the hashing assignment of the values of the partitioning key column.
If it's partitioned and you still can't load all of the data, the next thing to look at would be the schema. There are formulas in the Planning Guide that describe the memory usage for given datatypes and for indexes. The VMC interface also has a sizing worksheet that gives you the mins and maxes based on the schema. You could also post the definition of the table you are trying to load, along with any indexes you have defined on it, and we can explain more about the bytes it would use per row.

Couchdb disk size 10x aggregrate document size

I have a couchdb with ~16,000 similar documents of about 500 bytes each. The stats for the db report (commas added):
"disk_size":73,134,193,"data_size":7,369,551
Why is the disk size 10x the data_size? I would expect, if anything, for the disk size to be smaller as I am using the default (snappy) compression and this data should be quite compressible.
I have no views on this DB, and each document has a single revision. Compaction has very little effect.
Here's the full output from hitting the DB URI:
{"db_name":"xxxx","doc_count":17193,"doc_del_count":2,"update_seq":17197,"purge_seq":0,"compact_running":false,"disk_size":78119025,"data_size":7871518,"instance_start_time":"1429132835572299","disk_format_version":6,"committed_update_seq":17197}
I think you are getting correct results. couchdb stores documents in chunks of 4kb each (can't find a reference at the moment but you can test it out by storing an empty document). That is min size of a document is 4kb.
Which means that even if you store a data of 500 bytes per document couchdb is going to save it in chunks of 4kb each. So doing a rough calculation
17193*4*1024+(2*4*1024)= 70430720
That seems to be in the range of 78119025 still a little less but that could be due to the way files are stored on the disk.

how gemfirexd store the table data file greater than its in_memory?

i have in-memroy of 4GB.the data file iam going to load into GEMFIREXD is of 8GB. how in-memory organize the Remaining data 4 GB data.i read about EVICTION Class but i didn't get any clarification.
While loading the data it copied into disk OR after filling the 4GB it start coping into disk?
help onthis ..
thank you
If you use the EVICTION clause without using the PERSISTENT clause, the data will start being written to disk once you reach the eviction threshold. The least recently used rows will be written to disk and dropped from memory.
If you have a PERSISTENT table, the data is already on disk when you reach your eviction threshold. At that point, the least recently used rows are dropped from memory.
Note that there is still a per row overhead in memory even if the row is evicted.
Doc reference for details:
http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#overflow/configuring_data_eviction.html
- http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#caching_database/eviction_limitations.html