Couchdb disk size 10x aggregrate document size - compression

I have a couchdb with ~16,000 similar documents of about 500 bytes each. The stats for the db report (commas added):
"disk_size":73,134,193,"data_size":7,369,551
Why is the disk size 10x the data_size? I would expect, if anything, for the disk size to be smaller as I am using the default (snappy) compression and this data should be quite compressible.
I have no views on this DB, and each document has a single revision. Compaction has very little effect.
Here's the full output from hitting the DB URI:
{"db_name":"xxxx","doc_count":17193,"doc_del_count":2,"update_seq":17197,"purge_seq":0,"compact_running":false,"disk_size":78119025,"data_size":7871518,"instance_start_time":"1429132835572299","disk_format_version":6,"committed_update_seq":17197}

I think you are getting correct results. couchdb stores documents in chunks of 4kb each (can't find a reference at the moment but you can test it out by storing an empty document). That is min size of a document is 4kb.
Which means that even if you store a data of 500 bytes per document couchdb is going to save it in chunks of 4kb each. So doing a rough calculation
17193*4*1024+(2*4*1024)= 70430720
That seems to be in the range of 78119025 still a little less but that could be due to the way files are stored on the disk.

Related

What is the best performance I can get by querying DynamoDB for a maximum 1MB?

I am using DynamoDB for storing data. And I see 1MB is the hard limit for a query to return. I have a case that queries a table to fetch 1MB of data in one partition. I'd like to know what the best performance I can get.
Based on DynamoDB doc, one partition can have a maximum of 3000 RCU. If I send an eventual consistency read, it should support responding 3000 * 8KB = 24000KB = 23MB per second.
If I send one query request to fetch 1MB from one partition, does this mean it should respond 1/23 second = 43 milliseconds?
I am testing in a lambda sends a query to DynamoDB with XRay enabled. It shows me the query takes 300ms more based on XRay trace. So I don't understand why may cause the long latency.
What can I do if I want to reduce the latency to a single-digit millisecond? I don't want to split the partition since 1MB is not really big size.
DynamoDB really is capable of single-digit millisecond latency, but if the item size is small enough to fit into 1 RCU. Reading 1 MB of data from a database in <10ms is a challenging task itself.
Here is what you can try:
Split your read operation into two.
One will query with ScanIndexForward: true + Limit: N/2 and another will query with ScanIndexForward: false + Limit: N/2. The idea is to query the same data from both ends to the middle.
Do this in parallel and then you merge two responses into one.
However, this is likely will decrease latency from 300 to 150ms, which is still not <10ms.
Use DAX - DynamoDB Caching Layer
If your 1 MB of data is spread across thousands of items, consider using fewer items but each item will hold more data inside itself.
Consider using a compression algorithm like brotli to compress the data you store in 1 DynamoDB item. Once I had success with this approach. Depending on the format, it can easily reduce your data size by 4x, which translates into ~4x faster query time! Which could be 8x faster with the approach described in item #1.
Also, beware, that constantly reading 1 MB of data from a database will incur huge costs.

S3 Select Result/Response size

AWS Documentation mentions: The maximum length of a record in the input or result is 1 MB. https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html
However, I'm even able to fetch 2.4GB result on running an S3 Select query through a python lambda, and have seen people working with even more huge result size.
Can someone please highlight the significance of 1 MB mentioned in AWS documentation and what does it mean?
Background:
I recently faced the same question regarding the 1 MB limit. I'm dealing with a large gzip compressed csv file and had to figure out, if S3 Select would be an alternative to processing the file myself. My research makes me feel the author of the previous answer misunderstood the question.
The 1 MB limit referenced by the current AWS S3 Select documentation is referring to the record size:
... The maximum length of a record in the input or result is 1 MB.
The SQL Query is not the input (it has a lower limit though):
... The maximum length of a SQL expression is 256 KB.
Question Response:
I interpret this 1 MB limit the following way:
One row in the queried CSV file (uncompressed input) can't use more than 1 MB of memory
One result record (result row returned by S3 select) also can't use more than 1 MB of memory
To put this in a practical perspective, the following questions discussed the string size in bytes for Python. I'm using an UTF-8 encoding.
This means len(row.encode('utf-8')) (string size in bytes) <= 1024 * 1024 bytes for each csv row represented as UTF-8 encoded string of the input file.
It again means len(response_json.encode('utf-8')) <= 1024 * 1024 bytes for each returned response record (in my case the JSON result).
Note:
In my case, the 1 MB limit works fine. However, this depends a lot on the amount of data in your input (and potentially extra, static columns you might add via SQL).
If the limit 1MB is exceeded and you want to query files without a data base solution involved, using the more expensive AWS Athena might be a solution.
Could you point us to part of documentation which talking about this 1mb?
I have never seen 1 MB limit. Downloading of object is just downloading, and you can download almost unlimited file.
AWS Uplaods files with multipart upload and it has limits up to Terabytes for object and up to Gigabytes for objects part
Docs is here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
Response to the question
As per comment of author below my post:
Limit described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/querying-glacier-archives.html
This docs refers to query for archived objects. So you can do some query on data, without collecting it from the Glacier.
And input query cannot exceed 1MB. Output of that query cannot exceed 1MB.
Input is SQL query
Output is files list.
Find more info here: https://docs.aws.amazon.com/amazonglacier/latest/dev/s3-glacier-select-sql-reference-select.html
So this limit is not for files but for SQL-like queries.

DynamoDB: When does 1MB limit for queries apply

In the docs for DynamoDB it says:
In a Query operation, DynamoDB retrieves the items in sorted order, and then processes the items using KeyConditionExpression and any FilterExpression that might be present.
And:
A single Query operation can retrieve a maximum of 1 MB of data. This limit applies before any FilterExpression is applied to the results.
Does this mean, that KeyConditionExpression is applied before this 1MB limit?
Indeed, your interpretation is correct. With KeyConditionExpression, DynamoDB can efficiently fetch only the data matching its criteria, and you only pay for this matching data and the 1MB read size applies to the matching data. But with FilterExpression the story is different: DynamoDB has no efficient way of filtering out the non-matching items before actually fetching all of it then filtering out the items you don't want. So you pay for reading the entire unfiltered data (before FilterExpression), and the 1MB maximum also corresponds to the unfiltered data.
If you're still unconvinced that this is the way it should be, here's another issue to consider: Imagine that you have 1 gigabyte of data in your database to be Scan'ed (or in a single key to be Query'ed), and after filtering, the result will be just 1 kilobyte. Were you to make this query and expect to get the 1 kilobyte back, Dynamo would need to read and process the entire 1 gigabyte of data before returning. This could take a very long time, and you would have no idea how much, and will likely timeout while waiting for the result. So instead, Dynamo makes sure to return to you after every 1MB of data it reads from disk (and for which you pay ;-)). Control will return to you 1000 (=1 gigabyte / 1 MB) times during the long query, and you won't have a chance to timeout. Whether a 1MB limit actually makes sense here or it should have been more, I don't know, and maybe we should have had a different limit for the response size and the read amount - but definitely some sort of limit was needed on the read amount, even if it doesn't translate to large responses.
By the way, the Scan documentation includes a slightly differently-worded version of the explanation of the 1MB limit, maybe you will find it clearer than the version in the Query documentation:
A single Scan operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.

how gemfirexd store the table data file greater than its in_memory?

i have in-memroy of 4GB.the data file iam going to load into GEMFIREXD is of 8GB. how in-memory organize the Remaining data 4 GB data.i read about EVICTION Class but i didn't get any clarification.
While loading the data it copied into disk OR after filling the 4GB it start coping into disk?
help onthis ..
thank you
If you use the EVICTION clause without using the PERSISTENT clause, the data will start being written to disk once you reach the eviction threshold. The least recently used rows will be written to disk and dropped from memory.
If you have a PERSISTENT table, the data is already on disk when you reach your eviction threshold. At that point, the least recently used rows are dropped from memory.
Note that there is still a per row overhead in memory even if the row is evicted.
Doc reference for details:
http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#overflow/configuring_data_eviction.html
- http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#caching_database/eviction_limitations.html

How in-memory databases store data larger than RAM memory in GemfireXD?

If I am using a cluster of 4 nodes, each having 4GB RAM, so total RAM memory is 16 GB. And I have to store 20 GB of data in a table.
Then how in-memory database will accommodate this data ? I read somewhere the data is swapped between RAM & Disk , but wouldn't it make data access slow. Please Explain
GemFire or GemFireXD evicts the data to disk if it feels memory pressure while accommodating more data.
It may have some performance implications. However, user can control how and when eviction takes place. All the algorithms use Least Recently Used algorithms to evict the data.
Also, when a row is evicted, the primary key value remains in memory while the remaining column data is evicted. This makes fetching the row from disk faster.
You can go through the following links to understand about evictions in GemFireXD:
http://gemfirexd.docs.pivotal.io/1.3.0/userguide/developers_guide/topics/cache/cache.html
HANA offers the possibility to unload data from the main memory. Since the data is then stored on the harddisc, queries accessing this data will run slowlier of course. Have a look at the hot/warm/cold data concept if you haven't heard about it.
This article gives you additional information about this topic: http://scn.sap.com/community/bw-hana/blog/2014/02/14/sap-bw-on-hana-data-classification-hotwarmcold
Though the question only targeted SQLITE & HANA wanted to share some insights on Oracle's Database Inmemory. It achieves loading huge tables into inmemory area by using various compression algorithms. Data populated into the IM column store is compressed using a new set of compression algorithms that not only help save space but also improve query performance. For example, table with 10GB in size when compressed with capacity high sizes to 3GB. This allows, table whose size greater than RAM be stored in a compressed format in inmemory area.
The OP specifically asked about a cluster, so that rules out SQLite (at least out of the box). You need a DBMS that can:
treat the 4 X 4GB of memory as 16GB of "storage" (IOW distribute the data across the noes of the cluster, but treat it as a whole)
compress the data to squeeze the 20GB of raw data into the available 16GB
eXtremeDB is one such solution. So is Oracle's Database In-Memory (with RAC). I'm sure there are others.
If you configure your tables so, GemFireXD can use offheap memory to be able to store larger amount of data in memory, consequently pushing off the need to evict data onto disk a bit farther (although reads of evicted data are optimized for faster lookup because the lookup keys are in memory)
http://gemfirexd.docs.pivotal.io/1.3.1/userguide/data_management/off-heap-guidelines.html