Can you add a global secondary index to dynamodb after table has been created? - amazon-web-services

With an existing dynamodb table, is it possible to modify the table to add a global secondary index? From the dynamodb control panel, it looks like I have to delete the table and create a new one with the global index.

Edit (January 2015):
Yes, you can add a global secondary index to a DynamoDB table after its creation; see here, under "Global Secondary Indexes on the Fly".
Old Answer (no longer strictly correct):
No, the hash key, range key, and indexes of the table cannot be modified after the table has been created. You can easily add elements that are not hash keys, range keys, or indexed elements after table creation, though.
From the UpdateTable API docs:
You cannot add, modify or delete indexes using UpdateTable. Indexes can only be defined at table creation time.
To the extent possible, you should really try to anticipate current and future query requirements and design the table and indexes accordingly.
You could always migrate the data to a new table if need be.

Just got an email from Amazon:
Dear Amazon DynamoDB Customer,
Global Secondary Indexes (GSI) enable you to perform more efficient
queries. Now, you can add or delete GSIs from your table at any time,
instead of just during table creation. GSIs can be added via the
DynamoDB console or a simple API call. While the GSI is being added or
deleted, the DynamoDB table can still handle live traffic and provide
continuous service at the provisioned throughput level. To learn more
about Online Indexing, please read our blog or visit the documentation
page for more technical and operational details.
If you have any questions or feedback about Online Indexing, please
email us.
Sincerely, The Amazon DynamoDB Team

According to the latest new from AWS, GSI support for existing tables will be added soon
Official statement on AWS forum

Related

Do DynamoDB secondary indexes contain actual table rows?

In the SQL world when you create a non clustered index it creates a separate data structure that allows you to find pointers to table rows based on a key that is not the primary key of the table.
From the DynamoDB docs it seems as though creating a secondary index creates a separate data structure that holds a copy of the actual table rows, not just a pointer to those rows.
Is that right?
That's partially correct - for a global secondary index, it will definitely create a second table and update that asynchronously based on the changes in the primary table. That's why you can only do eventually consistent reads on this index.
For local secondary indexes it's most likely the same table.
There is a talk from re:invent 2018, where they explain the underlying data structures, which I can highly recommend:
AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321)

Best practice of using Dynamo table when it needs to be periodically updated

In my use case, I need to periodically update a Dynamo table (like once per day). And considering lots of entries need to be inserted, deleted or modified, I plan to drop the old table and create a new one in this case.
How could I make the table queryable while I recreate it? Which API shall I use? It's fine that the old table is the target table. So that customer won't experience any outage.
Is it possible I have something like version number of the table so that I could perform rollback quickly?
I would suggest table name with a common suffix (some people use date, others use a version number).
Store the usable DynamoDB table name in a configuration store (if you are not already using one, you could use Secrets Manager, SSM Parameter Store, another DynamoDB table, a Redis cluster or a third party solution such as Consul).
Automate the creation and insertion of data into a new DynamoDB table. Then update the config store with the name of the newly created DynamoDB table. Allow enough time to switchover, then remove the previous DynamoDB table.
You could do the final part by using Step Functions to automate the workflow with a Wait of a few hours to ensure that nothing is happening, in fact you could even add a Lambda function that would validate whether any traffic is hitting the old DynamoDB.

DynamoDB - how to query by something that is not the primary key

So, I have a table on DybamoDB with this structure:
- userId as the primarykey (it's a uuid)
- email
- hashedPassword
I want to, as someone is signing up, find out if there's already someone using that email.
This should be easy but, as far as I know, you can't query on DynamoDB unless you are using the primary key as parameters or the sort key (and I'm not sure if it would make sense to make email a sort key).
The other way I found out was using a Global Secondary Index, which is pretty much an index table you create using another field as the primary sort of, but this is billable and since I'm still developing and testing I did not want to have expenses.
Does anyone have another option? Or am I wrong and there's another way to do it?
Like other answers, I also think that GSI is the best option here.
But I would like to also add that since search capabilities of DynamoDB are very limited, it is not uncommon to use DynamoDB with something else for that very purpose. One such use case is described in the AWS blog:
Indexing Amazon DynamoDB Content with Amazon Elasticsearch Service Using AWS Lambda
The main querying capabilities of DynamoDB are centered around lookups using a primary key. However, there are certain times where richer querying capabilities are required. Indexing the content of your DynamoDB tables with a search engine such as Elasticsearch would allow for full-text search.
Obviously, I don't recommend using ES over GSI in your scenario. But it is worth knowing that DynamoDB can be, and is often, used with other services to extend its search capabilities.
Even you put email as sort key alongside of userId as primary key, you can't query only using email(unless it is scan operation). You don't want to use scan to see whether email exists in your table. It's like iterating the each value by scanning the whole table.
I think your best option is global secondary index. Another option would be creating a new table which only includes email values, but in that case you have to write/maintain to multiple tables which is unnecessary.
The other way I found out was using a Global Secondary Index, which is pretty much a index table you create using another field as the primary sort of, but this is billable and since I'm still developing and testing I did not want to have expenses.
As #Ersoy has said, GSI is the legit solution, even it will increase the consumed writes units.
Dynamodb is cheap for a low-traffic app and/or a test environment, but to hold these expenses flat, you can:
Use dynamodb local during local devs/tests and CI builds
Choose a provisioned capacity mode for your table (you may find its free-tier interesting)

How do I avoid hot partitions when using DyanmoDB row level access control?

I’m looking at adding row-level permissions to a DynamoDB table using dynamodb:LeadingKeys to restrict access per Provider ID. Currently I only have one provider ID, but I know I will have more. However they providers will vary in size with those sizes being very unbalanced.
If I use Provider ID as my partition key, it seems to me like my DB will end up with very hot partitions for the large providers and mostly unused ones for the smaller providers. Prior to adding the row-level access control I was using deviceId as the partition key since it is a more random name, so partitions well, but now I think I have to move that to the sort key.
Current partitioning that works well:
HASHKEY: DeviceId
With permissions I think I need to go to:
HASHKEY: ProviderID (only a handful of them)
RangeKey: DeviceId
Any suggestions as to a better way to set this up?
In general, you no longer need to worry about hot partitions in DynamoDB, especially if the partition keys which are being requested the most remain relatively constant.
More Info: https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Expanding on Michael's comment...
If you don't need a range key now...why add one?
The only reason to have a range key is that you need to Query DDB and return multiple records.
If all you ever need is a single record using GetItem, then you don't need a range key.
Simply concatenate ${ProviderId}.${DeviceId} together to make up your hash key.
Edit
Since you want to be able to list device Ids for a single provider, then you do need providerID as the partition key and deviceID as the range key.
As Icehorn's answer mentions, "hot partitions" aren't as big a deal as they used to be. Unless you expect the data for a single providerID to go over 10GB, I'd start with the simple implementation of hashKey(providerID).
If you have expect more than 10GB of data or you end up with a hot partition...then consider concatenating (1..n) integer to the providerID.
This will mean that you'd have to query multiple partitions to get all the deviceIDs.
This approach is detailed in Multi Tenant SaaS Storage Strategies

Should I use a secondary index or separate ID lookup table in DynamoDB?

I'm migrating a database from mongodb to dynamodb and trying to understand best practices, especially with using secondary local indexes and sort keys.
My application pulls in html data from the web, and loads the data into several tables/collections. At the time of extraction it gives each item an extracted_id, unique to the website it's pulled from. Before loading the items, it gives each item a UUID as its primary/partition key.
Problem: In order to avoid assigning different uuids to the same extracted_id I query the db to check if the entity has a preexisting entity_uuid.
Current Solution: Currently in mongodb, I have two sets of tables/collections. One for storing all items, and one for storing an entity's extracted_id(as key) / entity_uuid (as value) lookup table.
Better Solution?: As I move to DynamoDB would it be better to only create one database with extracted_id as a local secondary index, as to not store duplicate data? I'm unsure as the docs say to use indexes sparingly. I don't use the extracted_id for anything other than providing items with their uuid for a given site.
Hopefully this makes sense, I'm new to AWS / DynamoDB and would appreciate any tips / better solutions to the ones mentioned.
Why not just make extracted_id the partition key of your new DynamoDB table and use a ConditionExpression attribute_not_exists(extracted_id) to prevent your application from writing duplicate entries?