How to implement the division of two relations in mapreduce? - mapreduce

I want to implement the division of two relations in MapReduce. I have two relations: T(A,B,C) and U(B,C). I know that
for the relations R(A,B,D) and S(A,B). This is pretty much my scenario. I am not sure how I would go about implementing this in mapReduce. With my limited knowledge Im guessing there would be 3 map reduce jobs. I would assume the first round might be the (projection of B -C(T) x U) - T
Mapper 1 our input is either a tuple from T or U
If tuple t belongs to (a,b,c) from T then we have key: NULL and value ("T" a)
if tuple t belongs to (b,c) from U then we have key NULL and value (b,c "U")
With these values we can perform the cartesian product between ("T" a ) with the values (b,c "U") and emit the new key null and value (a,b,c)
Reducer 2
we remove from the new cartesian tuples any that are in the original table T and emit the tuples that are not contained in the original table.
I am confused about what I would do next. Whether it would be another mapper or could I use again a reducer for the next B -C projection? I'm not sure if I did the first round correctly. If anyone can tell me the steps how this would go preferably in pseudo-code that would me understand. Online I do not find any answers for this.

Related

ExclusiveStartKey changes latestEvaluatedKey

I am trying to scan a large table, and I was hoping to do it in chunks by only getting so many, and then saving the lastEvaluatedKey so I could use it as my exslusiveStartKey when I start up the query again.
I have noticed that when I test on smaller tables, I may scan the entire table and get:
Key: A
Key: B
Key: C
Key: D
Key: E
Now, when I select key C as my exslusiveStartKey, I would expect to get back D and E as I run through the rest of the table. However, I will sometimes get different keys. Is this expectation correct?
Something that might be causing problems is that my keys are not alphabetically the same. So some start with a U and some start with an N. If I am using an exclusiveStartKey that starts with a U, am I ignoring any that starts with an N? I know exclusiveStartKey aims for things greater than its value.
DynamoDB keys have two part - the hash key and the sort key. As the names suggest, while the sort-key part is sorted (for strings, that's an alphabetical order), the hash-key part is not sorted alphabetical. Instead, is sorted by the value hash function, which means their order appears random although consistent: If you scan the same table twice and it didn't change, you should get back the keys in the same seemingly-random order. ExclusiveStartKey can be used to start in the middle of this order, but it shouldn't change the order.
In your example, if a Scan returned A, B, C, D, E in this order (note that as I said, it usually will not be in alphabetical order if you have hash keys!), then if you set ExclusiveStartKey to C you will definitely expect to get D and E for the scan. I don't know how you saw something else - I suspect you did something wrong.
You mentioned the possibility of the table changing in parallel, and whether this has any effect on the result. Well, if according to the hash function a key X comes between C and D, and someone wrote to key X, it is indeed possible that your scan with ExclusiveStartKey=C would find X. However, since in your example we assume that A comes before C, a scan with ExclusiveStartKey=C can never return A - the scan looks for keys whose hash function values are greater than C's - not for newly written data, so A doesn't match.

Use case for "sets of tuple data" in Pyomo

When we specify the data for a set we have the ability to give it tuples of data. For example, we could write in our .dat file the following:
set A : 1 2 3 :=
1 + - -
2 - - +
3 - + +
This would specify that we would have 4 tuples in our set: (1,1), (2,3), (3,2), (3,3)
But I guess that I am struggling to understand exactly why we would want to do this? Furthermore, suppose we instantiated a Set object in our code as:
model.Aset = RangeSet(4, dimen=2)
Would this then specify that our tuples would have the indices 1, 2, 3, and 4?
I am thinking that specifying tuples in our set could potentially be useful when working with some data in which it's important to have a bit of a "spatial" understanding of the problem. But I would be curious to hear from the community what the potential applications of specifying set data this way might be.
The most common place this appears is when you're trying to model edges between nodes in a network. Networks aren't usually completely dense (have edges between every pair of nodes) so it's beneficial to represent just the edges that appear using a sparse set of tuples.

Ordering by sum of difference

I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?
best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)
The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.

Scala List filter by date difference

I've got a problem and I don't really know how to do it in a proper Scala way.
I've got a list of objects, holding a date. I want to do something like this:
I want to make a selection using an acceptable time value, like 2 hours, between 2 successors in the list. The purpose is to keeps user trend comparing to a point (if he shows up 2 times here, or 1 or 15 !).
The algorithm I imagined:
Let's keep 2 points A and B. We calculate the time difference between the 2 points and then evaluate if it's acceptable or not (>2h, acceptable).
If it's not acceptable, we reject B and then new B is the next list element.
If acceptable, B becomes A and the new B is the next list element.
How to do it, with some filters or collects? Oh and if the algorithm doesn't sounds good for you, I'm open to criticism!
Edit: I'm not asking for the solution, but just the right functions to lookup!
Say I have a list of integers, and I want to step through them and only keep the ones which are more than 1 greater than the previous. I would use foldLeft to step through them, building a list of only the items which are acceptable:
val nums = List(1,2,4,5,7)
nums.foldLeft(List[Int]()){
case (List(), b) => List(b)
case (list, b) if b - list.head > 1 => list :+ b
case (list, b) => list
}

Haskell List Monad State Dependance

I have to write a program in Haskell that will solve some nondeterministic problem.
I think i understand List Monad in 75% so it is oblivious choice but...
(My problem is filling n x m board with ships and water i am given sums of rows and colums every part of ship has its value etd its not important right now).
I want to guard as early as possible to make algoritm effective the problem is that possibility of insertion of ship is dependant from what i am given / what i have inserted in previus moves lets call it board state and i have no idea how to pass it cuz i can't generate a new state from board alone)
My Algoritm is:
1. Initialize First Board
2. Generate First Row trying applying every possible insertion (i can insert sheep verticaly so i need to remember to insert other parts of sheep in lower rows)
3. Solve the problem for smaller board (ofc after generating each 2 rows i check is everything ok)
But i have no idea how can I pass new states cuz as far as i have read about State Monad it generates new state from old state alone and this is impossible for me to do i would want to generate new state while doing operations on value).
I am sorry for my hatred towards Haskell but after few years of programing in imperative languages being forced to fight with those Monads to do things which in other languages i could write almost instantly makes me mad. (well other things in Haskell are fine for me and some of them are actually quite nice).
Combine StateT with the list monad to get your desired behavior.
Here's a simple example of using the non-determinism of the list monad while still keeping a history of previous choices made:
import Control.Monad
import Control.Monad.Trans.Class
import Control.Monad.Trans.State
fill :: StateT [Int] [] [Int]
fill = do
history <- get
if (length history == 3)
then return history
else do
choice <- lift [0, 1, 2]
guard (choice `notElem` history)
put (choice:history)
fill
fill maintains a separate history for each path that it tries out. If it fills up the board it returns successfully, but if the current choice overlaps with a previous choice it abandons that solution and tries a different path.
You run it using evalStateT, supplying an initial empty history:
>>> evalStateT fill []
[[2,1,0],[1,2,0],[2,0,1],[0,2,1],[1,0,2],[0,1,2]]
It returns a list of all possible solutions. In this case, that just happens to be the list of all permutations in which we could have filled up the board.