how gemfirexd store the table data file greater than its in_memory?

how gemfirexd store the table data file greater than its in_memory? - in-memory-database

i have in-memroy of 4GB.the data file iam going to load into GEMFIREXD is of 8GB. how in-memory organize the Remaining data 4 GB data.i read about EVICTION Class but i didn't get any clarification.
While loading the data it copied into disk OR after filling the 4GB it start coping into disk?
help onthis ..
thank you

If you use the EVICTION clause without using the PERSISTENT clause, the data will start being written to disk once you reach the eviction threshold. The least recently used rows will be written to disk and dropped from memory.
If you have a PERSISTENT table, the data is already on disk when you reach your eviction threshold. At that point, the least recently used rows are dropped from memory.
Note that there is still a per row overhead in memory even if the row is evicted.
Doc reference for details:
http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#overflow/configuring_data_eviction.html
- http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#caching_database/eviction_limitations.html

Related

Redshift table sizes & flavours of

Confused by the term 'table size' in Redshift.
We have :
svv_table_info.size
"Size of table in 1MB blocks"
svv_table_info.pct_used
"Percent of available space used"
... so I assume that a lot of the 'size' is empty space due to sort keys etc
Then we have this..
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
.. which uses the term 'minimum' table size.
But nowhere can I find an explanation of what they means in the real world ? Is this a theoretical minimum if optimally configured ?
Ultimately I need to find out the basic size of original tangible data without any overheads.
Then yes, how much disc space is it actually costing to store it in Redshift.
So if I took 1TB out of our on-prem database and shoved it into Redshift, I'd be looking to see something like 1TB (data) & 1.2TB (data + Redshift overheads).
Hope someone can help clarify 🤞

Redshift stores data in 1MB blocks and blocks are associated with a slice and a column. So if I have 2 slices in my cluster and a table with 4 columns (plus the 3 system columns to make 7) distributed as EVEN containing at least 2 rows, then my table will minimally take up 2 X 7 X 1MB of space (14MB on disk). This is all that article is saying.
Now if I insert 2 additional rows into this table, Redshift will makes new blocks for this data. So now my 4 rows of data take up 28MB of space. However, if I Vacuum the table the wasted space will be reclaimed and the table size will come back down to 14MB. (yes this is a bit of an oversimplification but trying to get the concepts across)
As a rule of thumb a single 1MB block will typically hold between 100,000 rows and 2,000,000 rows of compressed data. (yes this depends on the data not being monster varchars) So for our table above I can keep adding rows (and vacuuming) without increasing the table size on disk until I get a few hundred thousand rows (per slice) in the table. Redshift is very efficient at storing large chunks of data but very inefficient at storing small ones.
What Redshift knows about your data size is how many blocks it takes on disk (across all the nodes, slices, and columns). How big your data would be if it was stored differently (not in blocks, compressed or uncompressed) is not data that is tracked. As John noted, for big tables, Redshift stores data more efficiently than most other database (when compression is used).

You cannot translate from an existing database size to the size of a table in Redshift. This is because:
Columns are stored separately
Minimum block size is 1MB
Data in Redshift is compressed, so it can take considerably less space depending on the type of data and the compression type chosen
Given compression, your data is likely to be smaller in Redshift than an original (uncompressed) data source. However, you can't really calculate that in advance unless you have transferred similar data in the past and apply a similar ratio.

why AWS file size is different between Redshift and S3?

I'm UNLOADing tables from Redshift to S3 for backup. So I am checking to make sure the files are complete if we need them again.
I just did UNLOAD on a table that has size = 1,056 according to:
select "table", size, tbl_rows
FROM svv_table_info;
According to the documentation, the size is "in 1 MB data blocks", so this table is using 1,056 MB. But after copying to S3, the file size is 154 MB (viewing in AWS console).
I copied back to Redshift and all the rows are there, so this has to do with "1 MB data blocks". This is related to how it's saved in the file system, yes?
Can someone please explain? Thank you.

So you're asking why the SVV_TABLE_INFO view claims that the table consumes 1 GB, but when you dump it to disk the result is only 154 MB?
There are two primary reasons. The first is that you're actively updating the table but not vacuuming it. When a row is updated or deleted, Redshift actually appends a new row (yes, stored as columns) and tombstones the old row. To reclaim this space, you have to regularly vacuum the table. While Redshift will do some vacuuming in the background, this may not be enough, or it may not have happened at the time you're looking.
The second reason is that there's overhead required to store table data. Each column in a table is stored as a list of 1 MB blocks, one block per slice (and multiple slices per node). Depending on the size of your cluster and the column data type, this may lead to a lot of wasted space.
For example, if you're storing 32-bit integers, one 1MB block can store 256,000 such integers, requiring a total of 4 blocks to store 1,000,000 values (which is probably close to number of rows in your table). But, if you have a 4-node cluster, with 2 slices per node (ie, a dc2.large), then you'll actually require 8 blocks, because the column will be partitioned across all slices.
You can see the number of blocks that each column uses in STV_BLOCKLIST.

VoltDB is exhausting the RAM while loading the data

I am trying to load the database tables into VoltDB database using csvloader utility of VoltDB. When I am trying to load one table of size 5GB, Voltdb eats the RAM so fast that free RAM become 200 MB from 55 GB, then the VoltDB process gets killed by the system.
What can be the reason for this and what are the recommended setting for VoltDB to avoid this?

Is the table you are loading partitioned? That's the first thing to check, because if you have the default sitesperhost=8 on a single server, and the table is not partitioned, there will be a complete copy of the table in each of the 8 partitions. If the table is partitioned, the data is distributed among the partitions based on the hashing assignment of the values of the partitioning key column.
If it's partitioned and you still can't load all of the data, the next thing to look at would be the schema. There are formulas in the Planning Guide that describe the memory usage for given datatypes and for indexes. The VMC interface also has a sizing worksheet that gives you the mins and maxes based on the schema. You could also post the definition of the table you are trying to load, along with any indexes you have defined on it, and we can explain more about the bytes it would use per row.

Couchdb disk size 10x aggregrate document size

I have a couchdb with ~16,000 similar documents of about 500 bytes each. The stats for the db report (commas added):
"disk_size":73,134,193,"data_size":7,369,551
Why is the disk size 10x the data_size? I would expect, if anything, for the disk size to be smaller as I am using the default (snappy) compression and this data should be quite compressible.
I have no views on this DB, and each document has a single revision. Compaction has very little effect.
Here's the full output from hitting the DB URI:
{"db_name":"xxxx","doc_count":17193,"doc_del_count":2,"update_seq":17197,"purge_seq":0,"compact_running":false,"disk_size":78119025,"data_size":7871518,"instance_start_time":"1429132835572299","disk_format_version":6,"committed_update_seq":17197}

I think you are getting correct results. couchdb stores documents in chunks of 4kb each (can't find a reference at the moment but you can test it out by storing an empty document). That is min size of a document is 4kb.
Which means that even if you store a data of 500 bytes per document couchdb is going to save it in chunks of 4kb each. So doing a rough calculation
17193*4*1024+(2*4*1024)= 70430720
That seems to be in the range of 78119025 still a little less but that could be due to the way files are stored on the disk.

How in-memory databases store data larger than RAM memory in GemfireXD?

If I am using a cluster of 4 nodes, each having 4GB RAM, so total RAM memory is 16 GB. And I have to store 20 GB of data in a table.
Then how in-memory database will accommodate this data ? I read somewhere the data is swapped between RAM & Disk , but wouldn't it make data access slow. Please Explain

GemFire or GemFireXD evicts the data to disk if it feels memory pressure while accommodating more data.
It may have some performance implications. However, user can control how and when eviction takes place. All the algorithms use Least Recently Used algorithms to evict the data.
Also, when a row is evicted, the primary key value remains in memory while the remaining column data is evicted. This makes fetching the row from disk faster.
You can go through the following links to understand about evictions in GemFireXD:
http://gemfirexd.docs.pivotal.io/1.3.0/userguide/developers_guide/topics/cache/cache.html

HANA offers the possibility to unload data from the main memory. Since the data is then stored on the harddisc, queries accessing this data will run slowlier of course. Have a look at the hot/warm/cold data concept if you haven't heard about it.
This article gives you additional information about this topic: http://scn.sap.com/community/bw-hana/blog/2014/02/14/sap-bw-on-hana-data-classification-hotwarmcold

Though the question only targeted SQLITE & HANA wanted to share some insights on Oracle's Database Inmemory. It achieves loading huge tables into inmemory area by using various compression algorithms. Data populated into the IM column store is compressed using a new set of compression algorithms that not only help save space but also improve query performance. For example, table with 10GB in size when compressed with capacity high sizes to 3GB. This allows, table whose size greater than RAM be stored in a compressed format in inmemory area.

The OP specifically asked about a cluster, so that rules out SQLite (at least out of the box). You need a DBMS that can:
treat the 4 X 4GB of memory as 16GB of "storage" (IOW distribute the data across the noes of the cluster, but treat it as a whole)
compress the data to squeeze the 20GB of raw data into the available 16GB
eXtremeDB is one such solution. So is Oracle's Database In-Memory (with RAC). I'm sure there are others.

If you configure your tables so, GemFireXD can use offheap memory to be able to store larger amount of data in memory, consequently pushing off the need to evict data onto disk a bit farther (although reads of evicted data are optimized for faster lookup because the lookup keys are in memory)
http://gemfirexd.docs.pivotal.io/1.3.1/userguide/data_management/off-heap-guidelines.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js