Here is my basic data structure (or the relevant portions anyway) in DynamoDB; I have a files table that holds file data and has an id for the file. I also have a 'Definitions' table that holds items defined in the file. Definitions also have an ID (as the primary key) as well as a field called 'SourceFile' that references the file id in order to tie the definition to it's source file.
Most of the time I want to just get the definition by it's id and optionally get the file later which works just fine. However, in some cases I need to get all definitions for a set of files. I can do this with a scan but it's slow and I know that it will get slower as the table grows and isn't recommended. However I'm not sure how to do this with a query.
I can create a GSI that uses the SourceFile field as the primary key and use that to query against. This sounds like an answer (and may be), however I'm not sure. The problem is that some libraries may have 5k or 10k files (maybe more in rare cases). In a GSI I can only query against 1 file ID per query so I would have to throw a new query for each file and I can't imagine it's going to be very efficient to throw 10K queries at DynamoDB...
Is it better to create a tight loop (or multiple threads) and hit it with a ton of queries or to scan the table? Is there another way to do this that I'm not thinking of?
This is during an indexing and analysis process that is expected to take a bit of time so it's ok that it's not instant but I'd like it to be as efficient as possible...
Scans are the most efficient if you expect to be looking for a majority of data in your database. You can retrieve up to 1MB per scan request, and for each unit of capacity available you can read 4KB, so assuming you have enough capacity provisioned, you can retrieve thousands of items in a single request (assuming the items are pretty small).
The only alternative I can think of is to add more metadata that can help you index the files & definitions at a higher level - like, for instance, the library name/id. With that you can create a GSI on library name/id and query that way.
Running thousands of queries is going to less efficient than scanning assuming you are storing on the order of tens/hundreds of thousands of items.
Related
I am storing 400,000 parquet files in S3 that are partitioned based on a unique id (e.g. 412812). The files range in size from 25kb to 250kb of data. I then want to query the data using Athena. Like so,
Select *
From Table
where id in (412812, 412813, 412814)
This query is much slower than anticipated. I want to be able to search for any set of ids and get a fast response. I believe it is slow is because Athena must search through the entire glue catalog looking for the right file (i.e., a full scan of files).
The following query is extremely fast. Less than a second.
Select *
From Table
where id = 412812
partition.filtering is enabled on the table. I tried adding an index to the table that was the same as the partition, but it did not speed anything up.
Is there something wrong with my approach or a table configuration that would make this process faster?
Your basic problem is that you have too many files and too many partitions.
While Amazon Athena does operate in parallel, there are limits to how many files it can process simultaneously. Plus, each extra file adds overhead for listing, opening, etc.
Also, putting just a single file in each partition greatly adds to the overhead of handling so many partitions and is probably counterproductive for increasing the efficiency of the system.
I have no idea of how you actually use your data, but based on your description I would recommend that you create a new table that is bucketed_by the id, rather than partitioned:
CREATE TABLE new_table
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location = 's3://bucket/new_location/',
bucketed_by = ARRAY['id']
)
AS SELECT * FROM existing_table
Let Athena create as many files as it likes -- it will optimize based upon the amount of data. More importantly, it will create larger files that allow it to operate more efficiently.
See: Bucketing vs Partitioning - Amazon Athena
In general, partitions are great when you can divide the data into some major subsets (eg by country, state or something that represents a sizeable chunk of your data), while bucketing is better for fields that have values that are relatively uncommon (eg user IDs). Bucketing will create multiple files and Athena will be smart enough to know which files contain the IDs you want. However, it will not be partitioned into subdirectories based upon those values.
Creating this new table will greatly reduce the number of files that Amazon Athena would need to process for each query, which will make your queries run a LOT faster.
I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.
Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).
We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.
We have tried two different approaches, with two different data structures.
Approach 1
AWS DynamoDB - Pick a record/item randomly?
Help selecting nth record in query.
In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).
Issues with this approach:
First query can fail if it is greater than/less than any of the UUIDs
Querying on any specific category will cause throttling at scale (Small number of partitions)
We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.
AWS Database Blog
Choosing the Right DynamoDB Partition Key
Approach 2
Amazon Web Services: How do we get random item from the dynamoDb's table?
Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.
By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.
Issues with this approach
We need a secondary data structure to manage the sequential counts across each category
Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.
Conclusion
Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.
Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).
Thanks!
Follow Up
After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).
We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.
For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.
We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.
If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!
This was too long for a comment, and I guess it's pretty much a full fledged answer now.
Approach 2
I've found that my typical time to get a single item from dynamodb to a host in the same region is <10ms. As long as you're okay with at most 1-2 extra calls, you can quite easily implement approach 2.
If you use a keys only GSI where the category is your hash key and the primary key of the table is your range key, you can quickly find the largest numbered single item within a category.
When you add a new item, find the largest number for that category from the GSI and then write the new item to the table with sequence number n+1.
When you delete, find the item with the largest sequence number for that category from the GSI, overwrite the item you are deleting, and then delete the now duplicated item from its position at the highest sequence number.
To randomly get an item, query the GSI to find the highest numbered item in the category, and then randomly pick a number since you now know the valid range.
Approach 1
I'm not sure exactly what you mean when you say "without requiring extra resources to be implemented". If you're okay with using a managed resource (no dev work to implement), you can also make Approach 1 work by putting a DAX cluster in front of your dynamodb table. Then you can query to your heart's content without really worrying about hot partitions. (Though the caching layer means that new/deleted items won't be reflected right away.)
I have next table structure:
ID string `dynamodbav:"id,omitempty"`
Type string `dynamodbav:"type,omitempty"`
Value string `dynamodbav:"value,omitempty"`
Token string `dynamodbav:"token,omitempty"`
Status int `dynamodbav:"status,omitempty"`
ActionID string `dynamodbav:"action_id,omitempty"`
CreatedAt time.Time `dynamodbav:"created_at,omitempty"`
UpdatedAt time.Time `dynamodbav:"updated_at,omitempty"`
ValidationToken string `dynamodbav:"validation_token,omitempty"`
and I have 2 Global Secondary Indexes for Value(ValueIndex) filed and Token(TokenIndex) field. Later somewhere in the internal logic I perform the Update of this entity and immediate read of this entity by one of this indexes(ValueIndex or TokenIndex) and I see the expected problem that data is not ready(I mean not yet updated). I can't use ConsistentRead for this cases, because this is Global Secondary Index and it doesn't support this options. As a result I can't run my load tests over this logic, because data is not ready when tests go in 10-20-30 threads. So my question - is it possible to solve this problem somewhere? or should I reorganize my table and split it to 2-3 different tables and move filed like Value, Token to HASH key or SORT key?
GSIs are updated asynchronously from the table they are indexing. The updates to a GSI typically occur in well under a second. So, if you're after immediate read of a GSI after insert / update / delete, then there is the potential to get stale data. This is how GSIs work - nothing you can do about that. However, you need to be really mindful of three things:
Make sure you keep your GSI lean - that is, only project the absolute minimum attributes that you need. Less data to write will make it quicker.
Ensure that your GSIs have the correct provisioned throughput. If it doesn't, it may not be able to keep up with activity in the table and therefore you'll get long delays in the GSI being kept in sync.
If an update causes the keys in the GSI to be updated, you'll need 2 units of throughput provisioned per update. In essence, DynamoDB will delete the item then insert a new item with the keys updated. So, even though your table has 100 provisioned writes, if every single write causes an update to your GSI key, you'll need to provision 200 write units.
Once you've tuned your DynamoDB setup and you still absolutely cannot handle the brief delay in GSIs, you'll probably need to use different technology. For example, even if you decided to split your table into multiple tables, it'll have the same (if not worse) impact. You'll update one table, then try to read the data from another table and you haven't yet inserted the values into a different table.
I suspect that once you tune DynamoDB for your situation, you'll get pretty damn close you what you want.
Suppose I have a dynamo table named Person, which has 2 fields, name (string), age (int). Let's assume it has a TB worth of data and experiences a small amount of read throughput, but a ton of write throughput. Now I want to add a new field called Phone (string). What is the best way to go about moving the data from one table to another?
Note: Dynamo doesn't let you rename tables, and fields cannot be null.
Here are the options I think I have:
Dump the table to .csv, run a script (overnight probably since it's a TB worth of data) to add a default phone number to this file. (Not ideal, will also lose all new data submitted into old table, unless I bring the service offline to perform the migration (which is not an option in this case)).
Use the SCAN api call. (SCAN will read all values, then will consume significant write throughput on the new table to insert all old data into it).
How can I do perform a dynamo migration on a large table w/o
significant data loss?
you don't need to do anything. This is NoSQL, not SQL. (i.e. there is no idiomatic way to do this as you normally don't need migrations for NoSQL)
Just start writing entries with the additional key.
Records you get back that are written before will not have this key. What you normally do is have a default value you use when missing.
If you want to backfill, just go through and read the value + put the value with the additional field. You can do this in one run via a scan or again do it lazily when accessing the data.
I'm rewriting an application which handles a lot of data (about 100 GB) which is designed as a relational model.
The application is very complex; it is some kind of conversion tool for open street map data of huge sizes (the whole world) and converts it into a map file for our own route planning software. The converter application for example holds the nodes in the open street map with their coordinate and all its tags (a lot of more than that, but this should serve as an example in this question).
Current situation:
Because this data is very huge, I split it into several files: Each file is a map from an ID to an atomic value (let's assume that the list of tags for a node is an atomic value; it is not but the data storage can treat it as such). So for nodes, I have a file holding the node's coords, one holding the node's name and one holding the node's tags, where the nodes are identified by (non-continuous) IDs.
The application once was split into several applications. Each application processes one step of the conversion. Therefore, such an application only needs to handle some of the data stored in the files. For example, not all applications need the node's tags, but a lot of them need the node's coords. This is why I split the relations into files, one file for each "column".
Each processing step can read a whole file at once into a data structure within RAM. This ensures that lookups can be very efficient (if the data structure is a hash map).
I'm currently rewriting the converter. It should now be one single application. And it should now not use separated files for each "column". It should rather use some well-known architecture to hold external data in a relational manner, like a database, but much faster.
=> Which library can provide the following features?
Requirements:
It needs to be very fast in iterating over the existing data (while not modifying the set of rows, but some values in the current row).
It needs to provide constant or near-constant lookup, similar to hash maps (while not modifying the whole relation at all).
Most of the types of the columns are constantly sized, but in general they are not.
It needs to be able to append new rows to a relation in constant or logarithmic time per row. Live-updating some kind of search index will not be required. Updating (rebuilding) the index can happen after a whole processing step is complete.
Some relations are key-value-based, while others are an (continuously indexed) array. Both of them should provide fast lookups.
It should NOT be a separate process, like a DBMS like MySQL would be. The number of queries will be enormous (around 10 billions) and will be totally the bottle neck of the performance. However, caching queries would be a possible workaround: Iterating over a whole table can be done in a single query while writing to a table (from which no data will be read in the same processing step) can happen in a batch query. But still: I guess that serializing, inter-process-transmitting and de-serializing SQL queries will be the bottle neck.
Nice-to-have: easy to use. It would be very nice if the relations can be used in a similar way than the C++ standard and Qt container classes.
Non-requirements (Why I don't need a DBMS):
Synchronizing writing and reading from/to the same relation. The application is split into multiple processing steps; every step has a set of "input relations" it reads from and "output relations" it writes into. However, some steps require to read some columns of a relation while writing in other columns of the same relation.
Joining relations. There are a few cross-references between different relations, however, they can be resolved within my application if lookup is fast enough.
Persistent storage. Once the conversion is done, all the data will not be required anymore.
The key-value-based relations will never be re-keyed; the array-based relations will never be re-indexed.
I can think of several possible solutions depending on lots of factors that you have not quantified in your question.
If you want a simple store to look things up and you have sufficient disk, SQLite is pretty efficient as a database. Note that there is no SQLite server, the 'server' is linked into your application.
Personally this job smacks of being embarrassingly parallel. I would think that a small Hadoop cluster would make quick work of the entire job. You could spin it up in AWS, process your data, and shut it down pretty inexpensively.