I am trying to load the database tables into VoltDB database using csvloader utility of VoltDB. When I am trying to load one table of size 5GB, Voltdb eats the RAM so fast that free RAM become 200 MB from 55 GB, then the VoltDB process gets killed by the system.
What can be the reason for this and what are the recommended setting for VoltDB to avoid this?
Is the table you are loading partitioned? That's the first thing to check, because if you have the default sitesperhost=8 on a single server, and the table is not partitioned, there will be a complete copy of the table in each of the 8 partitions. If the table is partitioned, the data is distributed among the partitions based on the hashing assignment of the values of the partitioning key column.
If it's partitioned and you still can't load all of the data, the next thing to look at would be the schema. There are formulas in the Planning Guide that describe the memory usage for given datatypes and for indexes. The VMC interface also has a sizing worksheet that gives you the mins and maxes based on the schema. You could also post the definition of the table you are trying to load, along with any indexes you have defined on it, and we can explain more about the bytes it would use per row.
Related
Confused by the term 'table size' in Redshift.
We have :
svv_table_info.size
"Size of table in 1MB blocks"
svv_table_info.pct_used
"Percent of available space used"
... so I assume that a lot of the 'size' is empty space due to sort keys etc
Then we have this..
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
.. which uses the term 'minimum' table size.
But nowhere can I find an explanation of what they means in the real world ? Is this a theoretical minimum if optimally configured ?
Ultimately I need to find out the basic size of original tangible data without any overheads.
Then yes, how much disc space is it actually costing to store it in Redshift.
So if I took 1TB out of our on-prem database and shoved it into Redshift, I'd be looking to see something like 1TB (data) & 1.2TB (data + Redshift overheads).
Hope someone can help clarify 🤞
Redshift stores data in 1MB blocks and blocks are associated with a slice and a column. So if I have 2 slices in my cluster and a table with 4 columns (plus the 3 system columns to make 7) distributed as EVEN containing at least 2 rows, then my table will minimally take up 2 X 7 X 1MB of space (14MB on disk). This is all that article is saying.
Now if I insert 2 additional rows into this table, Redshift will makes new blocks for this data. So now my 4 rows of data take up 28MB of space. However, if I Vacuum the table the wasted space will be reclaimed and the table size will come back down to 14MB. (yes this is a bit of an oversimplification but trying to get the concepts across)
As a rule of thumb a single 1MB block will typically hold between 100,000 rows and 2,000,000 rows of compressed data. (yes this depends on the data not being monster varchars) So for our table above I can keep adding rows (and vacuuming) without increasing the table size on disk until I get a few hundred thousand rows (per slice) in the table. Redshift is very efficient at storing large chunks of data but very inefficient at storing small ones.
What Redshift knows about your data size is how many blocks it takes on disk (across all the nodes, slices, and columns). How big your data would be if it was stored differently (not in blocks, compressed or uncompressed) is not data that is tracked. As John noted, for big tables, Redshift stores data more efficiently than most other database (when compression is used).
You cannot translate from an existing database size to the size of a table in Redshift. This is because:
Columns are stored separately
Minimum block size is 1MB
Data in Redshift is compressed, so it can take considerably less space depending on the type of data and the compression type chosen
Given compression, your data is likely to be smaller in Redshift than an original (uncompressed) data source. However, you can't really calculate that in advance unless you have transferred similar data in the past and apply a similar ratio.
I'm UNLOADing tables from Redshift to S3 for backup. So I am checking to make sure the files are complete if we need them again.
I just did UNLOAD on a table that has size = 1,056 according to:
select "table", size, tbl_rows
FROM svv_table_info;
According to the documentation, the size is "in 1 MB data blocks", so this table is using 1,056 MB. But after copying to S3, the file size is 154 MB (viewing in AWS console).
I copied back to Redshift and all the rows are there, so this has to do with "1 MB data blocks". This is related to how it's saved in the file system, yes?
Can someone please explain? Thank you.
So you're asking why the SVV_TABLE_INFO view claims that the table consumes 1 GB, but when you dump it to disk the result is only 154 MB?
There are two primary reasons. The first is that you're actively updating the table but not vacuuming it. When a row is updated or deleted, Redshift actually appends a new row (yes, stored as columns) and tombstones the old row. To reclaim this space, you have to regularly vacuum the table. While Redshift will do some vacuuming in the background, this may not be enough, or it may not have happened at the time you're looking.
The second reason is that there's overhead required to store table data. Each column in a table is stored as a list of 1 MB blocks, one block per slice (and multiple slices per node). Depending on the size of your cluster and the column data type, this may lead to a lot of wasted space.
For example, if you're storing 32-bit integers, one 1MB block can store 256,000 such integers, requiring a total of 4 blocks to store 1,000,000 values (which is probably close to number of rows in your table). But, if you have a 4-node cluster, with 2 slices per node (ie, a dc2.large), then you'll actually require 8 blocks, because the column will be partitioned across all slices.
You can see the number of blocks that each column uses in STV_BLOCKLIST.
I wonder why unloading from a big table (>100 bln rows) when selecting by a column, which is NOT a sort key or a part of sort key, is immensely faster for newly added data. How Redshift understands that it is time to stop sequential scan in the second scenario?
Time the query spent executing. 39m 37.02s:
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-01-15\\' AND \\'2017-01-16\\'') TO ...
vs.
Time the query spent executing. 23.01s :
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-06-24\\' AND \\'2017-06-25\\'') TO ...
Thanks!
Amazon Redshift uses zone maps to identify the minimum and maximum value stored in each 1MB block on disk. Each block only stores data related to a single column (eg daytime).
If the SORTKEY is not set to daytime, then the data is unsorted and any particular date could appear in many different blocks. If SORTKEY is used, then a particular date will only appear in a minimum number of blocks.
Your second query possibly executes faster, even without a SORTKEY, because you are querying data that was probably added recently and is therefore all stored together in just a few blocks. The historical data might be spread in many blocks because a VACUUM probably reordered the data based upon the correct SORTKEY. In fact, if you did a VACUUM now, you might find that your second query becomes slower.
I have a use case where I continuously need to trickle feed data into dashDB, however I have been informed that this is not optimal for dashDB.
Why is this not optimal? Is there a workaround?
Columnar warehouses are great for reads, but if you insert a single row into an N column table then the system has to cut the row into pieces and do N separate writes to disk. This makes small inserts relatively inefficient and things can slow down as a result.
You may want to do an initial batch load of data. Currently the compression dictionary is built only for bulk loads, so if you start with a new table and populate it only using inserts then the data doesn't get compressed at all.
Try to structure the loading into microbatches with a 2-5 minute load cycle.
What is the use case here? Check if dashDB Transactional can solve your need. DashDB transactional is tuned for OLTP and point of sale transactions which is what you are trying to feed.
If I am using a cluster of 4 nodes, each having 4GB RAM, so total RAM memory is 16 GB. And I have to store 20 GB of data in a table.
Then how in-memory database will accommodate this data ? I read somewhere the data is swapped between RAM & Disk , but wouldn't it make data access slow. Please Explain
GemFire or GemFireXD evicts the data to disk if it feels memory pressure while accommodating more data.
It may have some performance implications. However, user can control how and when eviction takes place. All the algorithms use Least Recently Used algorithms to evict the data.
Also, when a row is evicted, the primary key value remains in memory while the remaining column data is evicted. This makes fetching the row from disk faster.
You can go through the following links to understand about evictions in GemFireXD:
http://gemfirexd.docs.pivotal.io/1.3.0/userguide/developers_guide/topics/cache/cache.html
HANA offers the possibility to unload data from the main memory. Since the data is then stored on the harddisc, queries accessing this data will run slowlier of course. Have a look at the hot/warm/cold data concept if you haven't heard about it.
This article gives you additional information about this topic: http://scn.sap.com/community/bw-hana/blog/2014/02/14/sap-bw-on-hana-data-classification-hotwarmcold
Though the question only targeted SQLITE & HANA wanted to share some insights on Oracle's Database Inmemory. It achieves loading huge tables into inmemory area by using various compression algorithms. Data populated into the IM column store is compressed using a new set of compression algorithms that not only help save space but also improve query performance. For example, table with 10GB in size when compressed with capacity high sizes to 3GB. This allows, table whose size greater than RAM be stored in a compressed format in inmemory area.
The OP specifically asked about a cluster, so that rules out SQLite (at least out of the box). You need a DBMS that can:
treat the 4 X 4GB of memory as 16GB of "storage" (IOW distribute the data across the noes of the cluster, but treat it as a whole)
compress the data to squeeze the 20GB of raw data into the available 16GB
eXtremeDB is one such solution. So is Oracle's Database In-Memory (with RAC). I'm sure there are others.
If you configure your tables so, GemFireXD can use offheap memory to be able to store larger amount of data in memory, consequently pushing off the need to evict data onto disk a bit farther (although reads of evicted data are optimized for faster lookup because the lookup keys are in memory)
http://gemfirexd.docs.pivotal.io/1.3.1/userguide/data_management/off-heap-guidelines.html