High level overview with simple integer order value to get my point across:
id (primary) | order (sort) | attributes ..
----------------------------------------------------------
ft8df34gfx 1 ...
ft8df34gfx 2 ...
ft8df34gfx 3 ...
ft8df34gfx 4 ...
ft8df34gfx 5 ...
Usually it would be easy to change the order (e.g if user drags and drops list items on front-end): shift item around, calculate new order values and update affected items in db with new order.
Constraints:
Doesn't have all the items at once, only a subset of them (think pagination)
Update only a single item in db if single item is moved (1 item per shift)
My initial idea:
Use epoch as order and append something unique to avoid duplicate epoch times, e.g <epoch>#<something-unique-to-item>. Initial value is insertion time (default order is therefore newest first).
Client/server (whoever calculates order) knows the epoch for each item in subset of items it has.
If item is shifted, look at the epoch of previous and next item (if has previous or next - could be moved to first or last), pick a value between and update. More than 1 shifts? Repeat the process.
But..
If items are shifted enough times, epoch values get closer and closer to each other until you can't find a middleground with whole integers.
Add lots of zeroes to epoch on insert? Still reach limit at some point..
If item is shifted to first or last and there are items in previous or next page (remember, pagination), we don't know these values and can't reliably find a "value between".
Fetch 1 extra hidden item from previous and next page? Querying gets complicated..
Is this even possible? What type/value should I use as order?
DynamoDB does not allow the primary partition and sort keys to be changed for a particular item (to change them, the item would need to be deleted and recreated with the new key values), so you'll probably want to use a local or global secondary index instead.
Assuming the partition/sort keys you're mentioning are for a secondary index, I recommend storing natural numbers for the order (1, 2, 3, etc.) and then updating them as needed.
Effectively, you would have three cases to consider:
Adding a new item - You would perform a query on the secondary partition key with ScanIndexForward = false (to reverse the results order), with a projection on the "order" attribute, limited to 1 result. That will give you the maximum order value so far. The new item's order will just be this maximum value + 1.
Removing an item - It may seem unsettling at first, but you can freely remove items without touching the orders of the other items. You may have some holes in your ordering sequence, but that's ok.
Changing the order - There's not really a way around it; your application logic will need to take the list of affected items and write all of their new orders to the table. If the items used to be (A, 1), (B, 2), (C, 3) and they get changed to A, C, B, you'll need to write to both B and C to update their orders accordingly so they end up as (A, 1), (C, 2), (B, 3).
Related
An example, a table with a column counter with only 1 row. I want to update this row's counter value based on its current value.
Say if counter is >= 10, set it to 0, otherwise counter++. How to achieve this if else clause in TransactWrite?
I can't have two actions in one transaction because Documentation states that it does not allow more than 1 action on the same item.
And of course, the reason I use TransactWrite is because there will be multiple lambda doing this task in parallel.
You cannot do it in one request, transactional or otherwise.
You can get the item, decide what to do, and then update the item in a second request accordingly. You’ll want to keep a version number or timestamp attribute to make sure the item hasn’t changed between the read and write, and use a condition expression to fail if it has.
That’s a common idiom:
https://dynobase.dev/dynamodb-locking/
You can't do it the way you like, because if/else is not supported in transactions. There is however a simpler solution.
Just Update the item and increment the counter. Whenever you read the value, you take the counter value modulo 10, and you get the desired behavior, e.g. 123 % 10 = 3 or 10 % 10 = 0.
I am trying to scan a large table, and I was hoping to do it in chunks by only getting so many, and then saving the lastEvaluatedKey so I could use it as my exslusiveStartKey when I start up the query again.
I have noticed that when I test on smaller tables, I may scan the entire table and get:
Key: A
Key: B
Key: C
Key: D
Key: E
Now, when I select key C as my exslusiveStartKey, I would expect to get back D and E as I run through the rest of the table. However, I will sometimes get different keys. Is this expectation correct?
Something that might be causing problems is that my keys are not alphabetically the same. So some start with a U and some start with an N. If I am using an exclusiveStartKey that starts with a U, am I ignoring any that starts with an N? I know exclusiveStartKey aims for things greater than its value.
DynamoDB keys have two part - the hash key and the sort key. As the names suggest, while the sort-key part is sorted (for strings, that's an alphabetical order), the hash-key part is not sorted alphabetical. Instead, is sorted by the value hash function, which means their order appears random although consistent: If you scan the same table twice and it didn't change, you should get back the keys in the same seemingly-random order. ExclusiveStartKey can be used to start in the middle of this order, but it shouldn't change the order.
In your example, if a Scan returned A, B, C, D, E in this order (note that as I said, it usually will not be in alphabetical order if you have hash keys!), then if you set ExclusiveStartKey to C you will definitely expect to get D and E for the scan. I don't know how you saw something else - I suspect you did something wrong.
You mentioned the possibility of the table changing in parallel, and whether this has any effect on the result. Well, if according to the hash function a key X comes between C and D, and someone wrote to key X, it is indeed possible that your scan with ExclusiveStartKey=C would find X. However, since in your example we assume that A comes before C, a scan with ExclusiveStartKey=C can never return A - the scan looks for keys whose hash function values are greater than C's - not for newly written data, so A doesn't match.
I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?
best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)
The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.
I am considering taking advantage of sparse indexes as described in the AWS guidelines. In the example described --
... in the GameScores table, certain players might have earned a particular achievement for a game - such as "Champ" - but most players have not. Rather than scanning the entire GameScores table for Champs, you could create a global secondary index with a partition key of Champ and a sort key of UserId.
My question is: what happens when the number of champs becomes very large? I suppose that the "Champ" partition will become very large and you would start to experience uneven load distribution. In order to get uniform load distribution, would I need to randomize the "Champ" value by (effectively) sharding over n shards, e.g. Champ.0, Champ.1 ... Champ.99?
Alternatively, is there a different access pattern that can be used when fetching entities with a specific attribute that may grow large over time?
this is exactly the solution you need (Champ.0, Champ.1 ... Champ.N)
N should be [expected partitions for this index + some growth gap] (if you expect for high load, or many 'champs' then you can choose N=200) (for a good hash distribution over partitions). i recommend that N will be modulo on userId. (this can help you to do some manipulations by userId.)
we also use this solution if your hash key is Boolean (in dynamodb you can represent boolean as string), so in this case the hash will be "true.0", "true.1" .... "true.N" and the same for "false".
I need to keep data of the following form:
(a,b,1),
(c,d,2),
(e,f,3),
(g,h,4),
(i,j,5),
(k,l,6),
(m,a,7)
...
such that the integers within the data (3rd column) are consecutively ordered and are unique. Also there are 2,954,208,208 such rows. I am searching for a data structure which returns the value of the 3rd column given the value of first two columns e.g.
Given: (i,j) it returns 5
And given the value of 3rd column, first two columns can be retrieved. For example,
Given: 5 it returns (a,b)
Is there some data structure which may help me achieve the same.
My approach towards solving this problem was to use hash-maps..but hash-maps do not turn out to be efficient. Is there some other way out.
The values in the first, second and third column are all of 64-bit.
I have 4GB of RAM.