I am trying to scan a large table, and I was hoping to do it in chunks by only getting so many, and then saving the lastEvaluatedKey so I could use it as my exslusiveStartKey when I start up the query again.
I have noticed that when I test on smaller tables, I may scan the entire table and get:
Key: A
Key: B
Key: C
Key: D
Key: E
Now, when I select key C as my exslusiveStartKey, I would expect to get back D and E as I run through the rest of the table. However, I will sometimes get different keys. Is this expectation correct?
Something that might be causing problems is that my keys are not alphabetically the same. So some start with a U and some start with an N. If I am using an exclusiveStartKey that starts with a U, am I ignoring any that starts with an N? I know exclusiveStartKey aims for things greater than its value.
DynamoDB keys have two part - the hash key and the sort key. As the names suggest, while the sort-key part is sorted (for strings, that's an alphabetical order), the hash-key part is not sorted alphabetical. Instead, is sorted by the value hash function, which means their order appears random although consistent: If you scan the same table twice and it didn't change, you should get back the keys in the same seemingly-random order. ExclusiveStartKey can be used to start in the middle of this order, but it shouldn't change the order.
In your example, if a Scan returned A, B, C, D, E in this order (note that as I said, it usually will not be in alphabetical order if you have hash keys!), then if you set ExclusiveStartKey to C you will definitely expect to get D and E for the scan. I don't know how you saw something else - I suspect you did something wrong.
You mentioned the possibility of the table changing in parallel, and whether this has any effect on the result. Well, if according to the hash function a key X comes between C and D, and someone wrote to key X, it is indeed possible that your scan with ExclusiveStartKey=C would find X. However, since in your example we assume that A comes before C, a scan with ExclusiveStartKey=C can never return A - the scan looks for keys whose hash function values are greater than C's - not for newly written data, so A doesn't match.
Related
High level overview with simple integer order value to get my point across:
id (primary) | order (sort) | attributes ..
----------------------------------------------------------
ft8df34gfx 1 ...
ft8df34gfx 2 ...
ft8df34gfx 3 ...
ft8df34gfx 4 ...
ft8df34gfx 5 ...
Usually it would be easy to change the order (e.g if user drags and drops list items on front-end): shift item around, calculate new order values and update affected items in db with new order.
Constraints:
Doesn't have all the items at once, only a subset of them (think pagination)
Update only a single item in db if single item is moved (1 item per shift)
My initial idea:
Use epoch as order and append something unique to avoid duplicate epoch times, e.g <epoch>#<something-unique-to-item>. Initial value is insertion time (default order is therefore newest first).
Client/server (whoever calculates order) knows the epoch for each item in subset of items it has.
If item is shifted, look at the epoch of previous and next item (if has previous or next - could be moved to first or last), pick a value between and update. More than 1 shifts? Repeat the process.
But..
If items are shifted enough times, epoch values get closer and closer to each other until you can't find a middleground with whole integers.
Add lots of zeroes to epoch on insert? Still reach limit at some point..
If item is shifted to first or last and there are items in previous or next page (remember, pagination), we don't know these values and can't reliably find a "value between".
Fetch 1 extra hidden item from previous and next page? Querying gets complicated..
Is this even possible? What type/value should I use as order?
DynamoDB does not allow the primary partition and sort keys to be changed for a particular item (to change them, the item would need to be deleted and recreated with the new key values), so you'll probably want to use a local or global secondary index instead.
Assuming the partition/sort keys you're mentioning are for a secondary index, I recommend storing natural numbers for the order (1, 2, 3, etc.) and then updating them as needed.
Effectively, you would have three cases to consider:
Adding a new item - You would perform a query on the secondary partition key with ScanIndexForward = false (to reverse the results order), with a projection on the "order" attribute, limited to 1 result. That will give you the maximum order value so far. The new item's order will just be this maximum value + 1.
Removing an item - It may seem unsettling at first, but you can freely remove items without touching the orders of the other items. You may have some holes in your ordering sequence, but that's ok.
Changing the order - There's not really a way around it; your application logic will need to take the list of affected items and write all of their new orders to the table. If the items used to be (A, 1), (B, 2), (C, 3) and they get changed to A, C, B, you'll need to write to both B and C to update their orders accordingly so they end up as (A, 1), (C, 2), (B, 3).
I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?
best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)
The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.
Let's say the database 'testdb' has collection 'testcollection'. In this collection, there is a key 'testfield'. This collection holds billions of documents.
Finding the exact value of key 'testfield' can be done very fast, even those are billions of documents, coz the index is based on radix tree (possibly?).
model.find({
testfield: "some-value"
})
However, when finding in a range of values, is it still fast against these billions of documents?
model.find({
testfield:{
$gte: "some-lower-value",
$lte: "some-upper-value"
}
})
Mats Peterson in the comment below the question:
All depends on the exact key-index sorting. Which in turn depends on
settings and configurations when constructing the database (can be
changed after the fact too). If it's a sorted list of testfield
values, then it should be relatively quick [assuming upper and lower
limits are relatively close, obviously if you have a wide range, a
wide part of the db gets sent your way, and that could take some time]
The time for finding a range is fast too, but based on the range of values.
The lower value and upper value can be found immediately in ~log(N) just like finding the exact value. As index should be a list of sorted values of the key, the output time depends on the loop to extract all entries from the lower value to upper value.
I need to keep data of the following form:
(a,b,1),
(c,d,2),
(e,f,3),
(g,h,4),
(i,j,5),
(k,l,6),
(m,a,7)
...
such that the integers within the data (3rd column) are consecutively ordered and are unique. Also there are 2,954,208,208 such rows. I am searching for a data structure which returns the value of the 3rd column given the value of first two columns e.g.
Given: (i,j) it returns 5
And given the value of 3rd column, first two columns can be retrieved. For example,
Given: 5 it returns (a,b)
Is there some data structure which may help me achieve the same.
My approach towards solving this problem was to use hash-maps..but hash-maps do not turn out to be efficient. Is there some other way out.
The values in the first, second and third column are all of 64-bit.
I have 4GB of RAM.
MapReduce basic information for passing and emiting key value pairs.
I need little bit clarity what we pass and what emits.
Here my concerns:
MapReduce Input and OutPut:
1.Map() method-Does it takes single or list of key-value pair and emits what?
2.For each input key-value pair,what mappers emit ? Same type or different type ?
3.For each intermediate key ,what the reducer will emit ? Is there any restriction of type ?
4.Reducer receives all values assocaited with same key.How the values will be ordered like sorted or orbitarly ordered ? Does that order vary from run to run ?
5.During shuffle and sort phase,In which order keys and values are presented ?
For each input k1, v1 map emits zero or more k2, v2.
For each k2 reducer receives k2, list(v1,v3,v4..).
For each input k2, list(v) reducer can emit zero or more k3, v3.
Values are arbitrarily ordered in step 2.
Key, value - output of mapper and reducer should be of same type i.e. all key must be same type and all value must be same type.
Map method: receive as input (K1,V1) and return (K2,V2). That is, the the output key and value can be different from the input key and value.
Reducer method: after the output of the mappers has been shuffled correctly (same key goes to the same reducer), the reducer input is (K2, LIST(V2)) and its output is (K3,V3).
As a result of the shuffling process, the keys arrives the reducer sorted by the key K2.
If you want to order the keys in your particular manner, you can implement the compareTo method of the key K3.
Referring your questions:
1. Answered above.
2. You can emit whatever you want as long it consists of a key and a value.
For example, in the WordCount you send as key the word and as value 1.
3. In the WordCount example, the reducer will receive a word and list of number.
Then, it will sum up the numbers and emit the word and its sum.
4. Answered above.
5. Answered above.