I see "BatchGetItem" API from AWS, that can be used to retrieve multiple items from dynamodb. But the capacity consumed is n units for n items even if they are just 10 bytes long. Is there a cheaper way of retrieving those items? I see the "Query" API but it probably doesn't support ORing of the keys and even the IN operator.
The cheapest way to get a large number of small items out of DynamoDB is a Query.
BatchGetItems costs the same the equivalent number of GetItem calls. Only the latency is lower, but if you have a large number of items to get in different partitions, and you have a language that works well with parallelism, you'll actually get better performance doing a large number of parallel GetItem calls.
One way to re-organize your data would be to store all the small items under the same partition key, and use a Query to just get all the items inside that partition. You'll have to note the writes per second limitations, but if you're adding more items infrequently and you're more worried about the reads this should work great.
In a Query, DynamoDB charges based on the size of data returned, irrespective of the number of items returned. But this seems to be the only place where these economics apply.
Based on what I know of DynamoDB, this makes sense. In a BatchGetItems call, the items can easily be from different partitions (and even different tables), so there's not much efficiency — each item must be individually looked up and fetched from its partition on the DynamoDB network. If all the keys in the batch are from the same partition and sequential on the range key, that's a lucky coincidence, but DynamoDB can't exploit it.
In a Query on the other hand, DynamoDB has perfect knowledge that it's only going to be talking directly to one single partition, and doing a sequential read inside that partition, so that call is much simpler and cheaper with savings passed on to customers.
Related
I'm designing a database schema where there are only 2 partition keys, however, both partitions will be accessed equally - every time I access partition A, I will also access partition B. Will this design suffer from hot partitioning?
From my understanding, hot partitioning occurs because of uneven accesses to the partitions resulting in some partitions being "hot". So I'm thinking cardinality does not matter as long as the accesses are even. Is this correct?
Don't worry about uneven.
Put very simply, each partition key may get put with it a certain amount of underlying capacity. The more partition keys, the easier for the system to add capacity as you need it. Does your database need a lot of capacity? If so, you may want to design with a different (more diverse) key strategy. If your needs are basic, don't overthink it and just get going.
We have a bucket in S3 where we store thousands of records every day (we end up having many GBs of data that keep increasing) and we want to be able to run Athena queries on them.
The data in S3 is stored in patterns like this:S3://bucket/Category/Subcategory/file.
There are multiple categories (more than 100) and each category has 1-20 subcategories. All the files we store in S3 (in apache parquet format) include sensor readings. There are categories with millions of sensor readings (sensors send thousands per day) and categories with just a few hundreds of readings (sensors send on average a few readings per month), so the data is not split evenly across categories. A reading includes a timestamp, a sensorid and a value among other things.
We want to run Athena queries on this bucket's objects, based on date and sensorid with the lowest cost possible. e.g.: Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.
What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries? We have the freedom to save one reading per file - resulting in millions of files (be able to easily partition per sensorid or date but what about performance if we have millions of files per day?) or multiple readings per file (much less files but not able to directly partition per sensor id or date because not all readings in a file are from the same sensor and we need to save them in the order they arrive). Is Athena a good solution for our case or is there a better alternative?
Any insight would be helpful. Thank you in advance
Some comments.
Is Athena a good solution for our case or is there a better alternative?
Athena is great when you don't need or want to set up a more sophisticated big data pipeline: you simply put (or already have) your data in S3, and you can start querying it immediately. If that's enough for you, then Athena may be enough for you.
Here are few things that are important to consider to properly answer that specific question:
How often are you querying? (i.e., is it worth have some sort of big data cluster running non-stop like an EMR cluster? or is it better to just pay when you query, even if it means that per query your cost could end up higher?)
How much flexibility do you want when processing the dataset? (i.e., does Athena offer all the capabilities you need?)
What are all the data stores that you may want to query "together"? (i.e., is and will all the data be in S3? or do you or will you have data in other services such as DynamoDB, Redshift, EMR, etc...?)
Note that none of these answers would necessarily say "don't use Athena" — they may just suggest what kind of path you may want to follow going forward. In any case, since your data is in S3 already, in a format suitable for Athena, and you want to start querying it already, Athena is a very good choice right now.
Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.
In both examples, you are filtering by category. This suggests that partitioning by category may be a good idea (whether you're using Athena or not!). You're doing that already, by having /Category/ as part of the objects' keys in S3.
One way to identify good candidates for partitioning schemes is to think about all the queries (at least the most common ones) that you're going to run, and check the filters by equality or the groups that they're doing. E.g., thinking in terms of SQL, if you often have queries with WHERE XXX = ?.
Maybe you have many more different types of queries, but I couldn't help but notice that both your examples had filters on category, thus it feels "natural" to partition by category (like you did).
Feel free to add a comment with other examples of common queries if that was just some coincidence and filtering by category is not as important/common as the examples suggest.
What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries?
There's hardly a single (i.e., best) answer here. It's always a trade-off based on lots of characteristics of the data set (structure; size; number of records; growth; etc) and the access patterns (proportion of reads and writes; kinds of writes, e.g. append-only, updates, removals, etc; presence of common filters among a large proportion of queries; which queries you're willing to sacrifice in order to optimize others; etc).
Here's some general guidance (not only for Athena, but in general, in case you decide you may need something other than Athena).
There are two very important things to focus on to optimize a big data environment:
I/O is slow.
Spread work evenly across all "processing units" you have, ideally fully utilizing each of them.
Here's why they matter.
First, for a lot of "real world access patterns", I/O is the bottleneck: reading from storage is many orders of magnitude slower than filtering a record in the CPU. So try to focus on reducing the amount of I/O. This means both reducing the volume of data read as well as reducing the number of individual I/O operations.
Second, if you end up with uneven distribution of work across multiple workers, it may happen that some workers finish quickly but other works take much longer, and their work cannot be divided further. This is also a very common issue. In this case, you'll have to wait for the slowest worker to complete before you can get your results. When you ensure that all workers are doing an equivalent amount of work, they'll all be working at near 100% and they'll all finish approximately at the same time. This way, you don't have to keep waiting longer for the slower ones.
Things to have in mind to help with those goals:
Avoid too big and too small files.
If you have a huge number of tiny files, then your analytics system will have to issue a huge number of I/O operations to retrieve data. This hurts performance (and, in case of S3, in which you pay per request, can dramatically increase cost).
If you have a small number of huge files, depending on the characteristics of the file format and the worker units, you may end up not being able to parallelize work too much, which can cause performance to suffer.
Try to keep the file sizes uniform, so that you don't end up with a worker unit finishing too quickly and then idling (may be an issue in some querying systems, but not in others).
Keeping files in the range of "a few GB per file" is usually a good choice.
Use compression (and prefer splittable compression algos).
Compressing files greatly improves performance because it reduces I/O tremendously: most "real world" datasets have a lot of common patterns, thus are highly compressible. When data is compressed, the analytics system spends less time reading from storage — and the "extra CPU time" spent to decompress the data before it can truly be queried is negligible compared to the time saved on reading form storage.
Keep in mind that there are some compression algorithms that are non-splittable: it means that one must start from the beginning of the compressed stream to access some bytes in the middle. When using a splittable compressions algorithm, it's possible to start decompressing from multiple positions in the file. There are multiple benefits, including that (1) an analytics system may be able to skip large portions of the compressed file and only read what matters, and (2) multiple workers may be able to work on the same file simultaneously, as they can each access different parts of the file without having to go over the entire thing from the beginning.
Notably, gzip is non-splittable (but since you mention Parquet specifically, keep in mind that the Parquet format may use gzip internally, and may compress multiple parts independently and just combine them into one Parquet file, leading to a structure that is splittable; in other words: read the specifics about the format you're using and check if it's splittable).
Use columnar storage.
That is, storing data "per columns" rather than "per rows". This way, a single large I/O operation will retrieve a lot of data for the column you need rather than retrieving all the columns for a few records and then discarding the unnecessary columns (reading unnecessary data hurts performance tremendously).
Not only you reduce the volume of data read from storage, you also improve how fast a CPU can process that data, since you'll have lots of pages of memory with useful data, and the CPU has a very simple set of operations to perform — this can dramatically improve performance at the CPU level.
Also, by keeping data organized by columns, you generally achieve better compression, leading to even less I/O.
You mention Parquet, so this is taken care of. If you ever want to change it, remember about using columnar storage.
Think about queries you need in order to decide about partitioning scheme.
Like in the example above about the category filtering, that was present in both queries you gave as examples.
When you partition like in the example above, you are greatly reducing I/O: the querying system will know exactly which files it needs to retrieve, and will avoid having to reading the entire dataset.
There you go.
These are just some high-level guidance. For more specific guidance, it would be necessary to know more about your dataset, but this should at least get you started in asking yourself the right questions.
I want to put a large number of items into dynamodb (probably around 100k per day. But this could scale upwards in the future).
A small percentage of these will get a lot more hits than the others (not sure on the exact figure, say 2%-5%). I won't be able to determine which in advance.
The hashkey for each is simply a unique positive integer (item_id). And I need the range key to be a unixtime stamp.
The problem is, will I run into a hot key situation with this set up? I'm not sure if partitions are created for every single hashkey value? Or are hashkeys randomly put into different partitions?
If it's the latter I should be safe because the items with more hits will be randomly distributed across the partitions. But if it's the former then some partitions will get a lot more hits than others
Don't be discouraged, no DynamoDB table has the perfectly distributed access patterns like the documentation suggests. You'll have some hot spots, it's normal and OK. You may have to increase your read/write throughput to accommodate the hotspots, and depending on how hot they are that might make a difference in the costs. But at the modest throughput levels you describe, it isn't going to make DynamoDB unusable or anything.
I recommend converting your capacity requirements into the per-second throughput metrics DynamoDB uses. Will the 100,000 per day really be evenly distributed to ~2 per second?
How many are reads vs. writes?
How big are they in 1K capacity chunks?
Is there a big difference between peak and trough usage?
Can caching be employed to smooth out the read pattern?
Yes, the hash keys will be distributed across partitions. Partitions do not correspond to individual items, but to allocations of read/write capacity and storage (Understanding Partition Behavior).
I had a use case where I wanted to store objects larger than 64kb in Dynamo DB. It looks like this is relatively easy to accomplish if you implement a kind of "paging" functionality, where you partition the objects into smaller chunks and store them as multiple values for the key.
This got me thinking however. Why did Amazon not implement this in their SDK? Is it somehow a bad idea to store objects bigger than 64kb? If so, what is the "correct" infrastructure to use?
In my opinion, it's an understandable trade-off DynamoDB made. To be highly available and redundant, they need to replicate data. To get super-low latency, they allowed inconsistent reads. I'm not sure of their internal implementation, but I would guess that the higher this 64KB cap is, the longer your inconsistent reads might be out of date with the actual current state of the item. And in a super low-latency system, milliseconds may matter.
This pushes the problem of an inconsistent Query returning chunk 1 and 2 (but not 3, yet) to the client-side.
As per question comments, if you want to store larger data, I recommend storing in S3 and referring to the S3 location from an attribute on an item in DynamoDB.
For the record, the maximum item size in DynamoDB is now 400K, rather than 64K as it was when the question was asked.
From a Design perspective, I think a lot of cases where you can model your problem with >64KB chunks could also be translated to models where you can split those chunks to <64KB chunks. And it is most often a better design choice to do so.
E.g. if you store a complex object, you could probably split it into a number of collections each of which encode one of the various facets of the object.
This way you probably get better, more predictable performance for large datasets as querying for an object of any size will involve a defined number of API calls with a low, predictable upper bound on latency.
Very often Service Operations people struggle to get this predictability out of the system so as to guarantee a given latency at 90/95/99%-tile of the traffic. AWS just chose to build this constraint into the API, as they probably already do for their own website ans internal developments.
Also, of course from an (AWS) implementation and tuning perspective, it is quite comfy to assume a 64KB cap as it allows for predictable memory paging in/out, upper bounds on network roundtrips etc.
We have an application which is completely written in C. For table access inside the code like fetching some values from a table we use Pro*C. And to increase the performance of the application we also preload some tables for fetching the data. We take some input fields and fetch the output fields from the table in general.
We usually have around 30000 entries in the table and max it reaches 0.1 million some times.
But if the table entries increase to around 10 million entries, I think it dangerously affects the performance of the application.
Am I wrong somewhere? If it really affects the performance, is there any way to keep the performance of the application stable?
What is the possible workaround if the number of rows in the table increases to 10 million considering the way the application works with tables?
If you are not sorting the table you'll get a proportional increase of search time... if you don't code anything wrong, in your example (30K vs 1M) you'll get 33X greater search times. I'm assumning you're incrementally iterating (i++ style) the table.
However, if it's somehow possible to sort the table, then you can greatly reduce search times. That is possible because an indexer algorithm that searchs sorted information will not parse every element till it gets to the sought one: it uses auxiliary tables (trees, hashes, etc), usually much faster to search, and then it pinpoints the correct sought element, or at least gets a much closer estimate of where it is in the master table.
Of course, that will come at the expense of having to sort the table, either when you insert or remove elements from it, or when you perform a search.
maybe you can go to 'google hash' and take a look at their implementation? although it is in C++
It might be that you have too many cache misses once you increase over 1MB or whatever your cache size is.
If you iterate table multiple times or you access elements randomly you can also hit lot of cache misses.
http://en.wikipedia.org/wiki/CPU_cache#Cache_Misses
Well, it really depends on what you are doing with the data. If you have to load the whole kit-and-kabootle into memory, then a reasonable approach would be to use a large bulk size, so that the number of oracle round trips that need to occur is small.
If you don't really have the memory resources to allow the whole result set to be loaded into memory, then a large bulk size will still help with the Oracle overhead. Get a reasonable size chunk of records into memory, process them, then get the next chunk.
Without more information about your actual run time environment, and business goals, that is about as specific as anyone can get.
Can you tell us more about the issue?