Relation between DISTSTYLE and Compression encoding in Redshift - amazon-web-services

Is there any relation between DISTSTYLE and Compression encoding in Redshift. As whenever we use Compression encoding the operating system on compute node do extra work of encoding and decoding the data; with DISTSTYLE set as ALL don't you thing every node had to do the decoding and encoding work ?
Any conceptual help here is highly appreciated.

The Distribution Style determines which node/slice will store the data. This has no relationship or impact on compression type. It is simply saying where to store the data.
Compression, however, is closely related to the Sort Key, which determines the order in which data is stored. Some compression methods use 'offsets' from previous values, or even storing the number of repeated values, which can significantly compress data (eg "repeat this value 1000 times" rather than storing 1000 values).
Compression within Amazon Redshift has two benefits:
Less storage space (thus, less cost)
More data can be retrieved for each disk access
The slowest operation of any database is disk access. Therefore, any reduction in disk access will speed operations. The time taken to decompress data is minor compared to the time required for an additional disk read operation.
The second most 'expensive' operation is sending data between nodes. While network traffic is faster than disk access, it is best avoided.
When using DISTSTYLE ALL, it simply means that the data is available on every node, which avoids the need to transfer data across the network.

Related

Best way to partition AWS Athena tables for querying S3 data with high cardinality

We have a bucket in S3 where we store thousands of records every day (we end up having many GBs of data that keep increasing) and we want to be able to run Athena queries on them.
The data in S3 is stored in patterns like this:S3://bucket/Category/Subcategory/file.
There are multiple categories (more than 100) and each category has 1-20 subcategories. All the files we store in S3 (in apache parquet format) include sensor readings. There are categories with millions of sensor readings (sensors send thousands per day) and categories with just a few hundreds of readings (sensors send on average a few readings per month), so the data is not split evenly across categories. A reading includes a timestamp, a sensorid and a value among other things.
We want to run Athena queries on this bucket's objects, based on date and sensorid with the lowest cost possible. e.g.: Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.
What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries? We have the freedom to save one reading per file - resulting in millions of files (be able to easily partition per sensorid or date but what about performance if we have millions of files per day?) or multiple readings per file (much less files but not able to directly partition per sensor id or date because not all readings in a file are from the same sensor and we need to save them in the order they arrive). Is Athena a good solution for our case or is there a better alternative?
Any insight would be helpful. Thank you in advance
Some comments.
Is Athena a good solution for our case or is there a better alternative?
Athena is great when you don't need or want to set up a more sophisticated big data pipeline: you simply put (or already have) your data in S3, and you can start querying it immediately. If that's enough for you, then Athena may be enough for you.
Here are few things that are important to consider to properly answer that specific question:
How often are you querying? (i.e., is it worth have some sort of big data cluster running non-stop like an EMR cluster? or is it better to just pay when you query, even if it means that per query your cost could end up higher?)
How much flexibility do you want when processing the dataset? (i.e., does Athena offer all the capabilities you need?)
What are all the data stores that you may want to query "together"? (i.e., is and will all the data be in S3? or do you or will you have data in other services such as DynamoDB, Redshift, EMR, etc...?)
Note that none of these answers would necessarily say "don't use Athena" — they may just suggest what kind of path you may want to follow going forward. In any case, since your data is in S3 already, in a format suitable for Athena, and you want to start querying it already, Athena is a very good choice right now.
Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.
In both examples, you are filtering by category. This suggests that partitioning by category may be a good idea (whether you're using Athena or not!). You're doing that already, by having /Category/ as part of the objects' keys in S3.
One way to identify good candidates for partitioning schemes is to think about all the queries (at least the most common ones) that you're going to run, and check the filters by equality or the groups that they're doing. E.g., thinking in terms of SQL, if you often have queries with WHERE XXX = ?.
Maybe you have many more different types of queries, but I couldn't help but notice that both your examples had filters on category, thus it feels "natural" to partition by category (like you did).
Feel free to add a comment with other examples of common queries if that was just some coincidence and filtering by category is not as important/common as the examples suggest.
What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries?
There's hardly a single (i.e., best) answer here. It's always a trade-off based on lots of characteristics of the data set (structure; size; number of records; growth; etc) and the access patterns (proportion of reads and writes; kinds of writes, e.g. append-only, updates, removals, etc; presence of common filters among a large proportion of queries; which queries you're willing to sacrifice in order to optimize others; etc).
Here's some general guidance (not only for Athena, but in general, in case you decide you may need something other than Athena).
There are two very important things to focus on to optimize a big data environment:
I/O is slow.
Spread work evenly across all "processing units" you have, ideally fully utilizing each of them.
Here's why they matter.
First, for a lot of "real world access patterns", I/O is the bottleneck: reading from storage is many orders of magnitude slower than filtering a record in the CPU. So try to focus on reducing the amount of I/O. This means both reducing the volume of data read as well as reducing the number of individual I/O operations.
Second, if you end up with uneven distribution of work across multiple workers, it may happen that some workers finish quickly but other works take much longer, and their work cannot be divided further. This is also a very common issue. In this case, you'll have to wait for the slowest worker to complete before you can get your results. When you ensure that all workers are doing an equivalent amount of work, they'll all be working at near 100% and they'll all finish approximately at the same time. This way, you don't have to keep waiting longer for the slower ones.
Things to have in mind to help with those goals:
Avoid too big and too small files.
If you have a huge number of tiny files, then your analytics system will have to issue a huge number of I/O operations to retrieve data. This hurts performance (and, in case of S3, in which you pay per request, can dramatically increase cost).
If you have a small number of huge files, depending on the characteristics of the file format and the worker units, you may end up not being able to parallelize work too much, which can cause performance to suffer.
Try to keep the file sizes uniform, so that you don't end up with a worker unit finishing too quickly and then idling (may be an issue in some querying systems, but not in others).
Keeping files in the range of "a few GB per file" is usually a good choice.
Use compression (and prefer splittable compression algos).
Compressing files greatly improves performance because it reduces I/O tremendously: most "real world" datasets have a lot of common patterns, thus are highly compressible. When data is compressed, the analytics system spends less time reading from storage — and the "extra CPU time" spent to decompress the data before it can truly be queried is negligible compared to the time saved on reading form storage.
Keep in mind that there are some compression algorithms that are non-splittable: it means that one must start from the beginning of the compressed stream to access some bytes in the middle. When using a splittable compressions algorithm, it's possible to start decompressing from multiple positions in the file. There are multiple benefits, including that (1) an analytics system may be able to skip large portions of the compressed file and only read what matters, and (2) multiple workers may be able to work on the same file simultaneously, as they can each access different parts of the file without having to go over the entire thing from the beginning.
Notably, gzip is non-splittable (but since you mention Parquet specifically, keep in mind that the Parquet format may use gzip internally, and may compress multiple parts independently and just combine them into one Parquet file, leading to a structure that is splittable; in other words: read the specifics about the format you're using and check if it's splittable).
Use columnar storage.
That is, storing data "per columns" rather than "per rows". This way, a single large I/O operation will retrieve a lot of data for the column you need rather than retrieving all the columns for a few records and then discarding the unnecessary columns (reading unnecessary data hurts performance tremendously).
Not only you reduce the volume of data read from storage, you also improve how fast a CPU can process that data, since you'll have lots of pages of memory with useful data, and the CPU has a very simple set of operations to perform — this can dramatically improve performance at the CPU level.
Also, by keeping data organized by columns, you generally achieve better compression, leading to even less I/O.
You mention Parquet, so this is taken care of. If you ever want to change it, remember about using columnar storage.
Think about queries you need in order to decide about partitioning scheme.
Like in the example above about the category filtering, that was present in both queries you gave as examples.
When you partition like in the example above, you are greatly reducing I/O: the querying system will know exactly which files it needs to retrieve, and will avoid having to reading the entire dataset.
There you go.
These are just some high-level guidance. For more specific guidance, it would be necessary to know more about your dataset, but this should at least get you started in asking yourself the right questions.

Decompress data on the fly from offsets in the original data?

I have a block of data I want to compress, say, C structures of variable sizes. I want to compress the data, but access specific fields of structures on the fly in application code without having to decompress the entire data.
Is there an algorithm which can take the offset (for the original data), decompress and return the data?
Compression methods generally achieve compression by making use of the preceding data. At any point in the compressed data, you need to know at least some amount of the preceding uncompressed data in order to decompress what follows.
You can deliberately forget the history at select points in the compressed data in order to have random access at those points. This reduces the compression by some amount, but that can be small with sufficiently distant random access points. A simple approach would be to compress pieces with gzip and concatenate the gzip streams, keeping a record of the offsets of each stream. For less overhead, you can use Z_FULL_FLUSH in zlib to do the same thing.
Alternatively, you can save the history at each random access point in a separate file. An example of building such a random access index to a zlib or gzip stream can be found in zran.c.
You can construct compression methods that do not depend on previous history for decompression, such as simple Huffman coding. However the compression ratio will be poor compared to methods that do depend on previous history.
Example compressed file system: There we have a filesystem API which doesn't need to know about the compression that happens before it's written to disk. There are a few algorithms out there.
Check here for more details.
However, maybe there is more gain in trying to optimize the used data structures so there is no need to compress them?
For efficient access an index is needed. So between arrays and MultiMaps and Sparse Arrays there should be a way to model the data that there is no need for further compression as the data is represented efficiently.
Of course this depends largely on the use case which is quite ambiguous.
A use case where a compression layer is needed to access the data is possible to imagine, but it's likely that there are better ways to solve the problem.

MemSQL Column Storage InMemory

Is it possible to store columnoriented tables inmemory in Memsql? Standard is row oriented tables in memory, column oriented on disk.
MemSQL columnstore tables are always disk-backed, however columnstore data is of course cached in memory, so if all your data happens to fit in memory you will get in-memory performance. (The disk only needs to be involved in that writes must persist to disk for durability, and after restart data must be loaded from disk before it can be read, just like for any durable in-memory store.)
In the rowstore, we use data structures and algorithms (e.g. lockfree skiplists) that take advantage of the fact that the data is in-memory to improve performance on point reads and writes, especially with high concurrency, but columnstore query execution works on fast scans over blocks of data and batch writes, which works well whether the data resides in-memory or on-disk.

Any key-value storages with emphasis on compression?

Are there any key-value storages which fit the following criteria?
are open-source
persistent file storage
have replication and oplog
have configurable compression usable for storing 10-100 megabytes of raw text per second
work on windows and linux
Desired interface should contain at least:
store a record by a text or numeric ID
retrieve a record by ID
wiredtiger does support different kind of compression:
Compression considerations
WiredTiger compresses data at several stages to preserve memory and
disk space. Applications can configure these different compression
algorithms to tailor their requirements between memory, disk and CPU
consumption. Compression algorithms other than block compression work
by modifying how the keys and values are represented, and hence reduce
data size in-memory and on-disk. Block compression on the other hand
compress the data in its binary representation while saving it on the
disk.
Configuring compression may change application throughput. For
example, in applications using solid-state drives (where I/O is less
expensive), turning off compression may increase application
performance by reducing CPU costs; in applications where I/O costs are
more expensive, turning on compression may increase application
performance by reducing the overall number of I/O operations.
WiredTiger uses some internal algorithms to compress the amount of
data stored that are not configurable, but always on. For example,
run-length reduces the size requirement by storing sequential,
duplicate values in the store only a single time (with an associated
count).
wiredtiger support different kind of compression:
key prefix
dictionary
huffman
and block compression which support among other things lz4, snappy, zlib and zstd.
Have look at the documentation for full cover of the subject.

Why does AWS Dynamo SDK do not provide means to store objects larger than 64kb?

I had a use case where I wanted to store objects larger than 64kb in Dynamo DB. It looks like this is relatively easy to accomplish if you implement a kind of "paging" functionality, where you partition the objects into smaller chunks and store them as multiple values for the key.
This got me thinking however. Why did Amazon not implement this in their SDK? Is it somehow a bad idea to store objects bigger than 64kb? If so, what is the "correct" infrastructure to use?
In my opinion, it's an understandable trade-off DynamoDB made. To be highly available and redundant, they need to replicate data. To get super-low latency, they allowed inconsistent reads. I'm not sure of their internal implementation, but I would guess that the higher this 64KB cap is, the longer your inconsistent reads might be out of date with the actual current state of the item. And in a super low-latency system, milliseconds may matter.
This pushes the problem of an inconsistent Query returning chunk 1 and 2 (but not 3, yet) to the client-side.
As per question comments, if you want to store larger data, I recommend storing in S3 and referring to the S3 location from an attribute on an item in DynamoDB.
For the record, the maximum item size in DynamoDB is now 400K, rather than 64K as it was when the question was asked.
From a Design perspective, I think a lot of cases where you can model your problem with >64KB chunks could also be translated to models where you can split those chunks to <64KB chunks. And it is most often a better design choice to do so.
E.g. if you store a complex object, you could probably split it into a number of collections each of which encode one of the various facets of the object.
This way you probably get better, more predictable performance for large datasets as querying for an object of any size will involve a defined number of API calls with a low, predictable upper bound on latency.
Very often Service Operations people struggle to get this predictability out of the system so as to guarantee a given latency at 90/95/99%-tile of the traffic. AWS just chose to build this constraint into the API, as they probably already do for their own website ans internal developments.
Also, of course from an (AWS) implementation and tuning perspective, it is quite comfy to assume a 64KB cap as it allows for predictable memory paging in/out, upper bounds on network roundtrips etc.