Do I need to sort after Group by, before an Merge Join - kettle

I need to GroupBy and MergeJoin in PDI (Kettle). Both are made ​​using the same field as the key.
I could not find anywhere to confirm that after the GroupBy data remains ordered.
In case I need to know if it would be correct:
SORT> GROUPBY> SORT> MERGEJOIN
or
SORT> GROUPBY> MERGEJOIN
Someone could tell me what the correct and why?
Thank you very much.

You need to sort BEFORE the Group By and the Merge Join based on the keys you're grouping or joining on. The data on exit will have the same order as before, so if you group and then merge based on the same keys, you don't need the sort between Group by and Merge Join.
If the keys change, however, you do.

Related

How AWS DynamoDB query result is a list of items while partition key is a unique value

I'm new to AWS DynamoDB and wanted to clarify something.
As I learned when we use query we should use the partition key which is unique among the list of items, then how the result from a query is a list!!! it should be one row!! and why do we even need adding more condition?
I think I am missing something here, can someone please help me to understand?
I need this because I want to query on list of applications with specific status value or for specific range of time but if I am supposed to provide the appId what is the point of the condition?
Thank you in advance.
Often your table will have sort key, which together with your partition key will create composite primary key. In this scenario a query can return multiple items. To return only one value, not a list, you can use get_item instead if you know unique value of the composite primary key.

How to run a 'greater than' query in Amazon DynamoDB?

I have a primary key in the table as 'OrderID', and it's numerical which increments for every new item. An example table would look like -
Let's assume that I want to get all orders above the OrderID '1002'. How would I do that?
Is there any possibility of doing this with DynamoDB Query?
Any help is appreciated :)
Thanks!
Unfortunately with this base table you cannot perform a query with a greater than for the partition key.
You have 3 choices:
Migrate to using scan, this will use up your read credits significantly.
Creating a secondary index, you'd want a global secondary index with the sort key becoming your order id. Take a look here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.html#GSI.OnlineOps.Creating.
Loop over in the application performing a Query or GetItem request from intial value until there are no results left (very inefficient).
The best practice would be to use the GSI if you can as this will be the most performant.

Redshift: Does key-based distribution optimize equality filters?

This documentation describes key-distribution in redshift as follows:
The rows are distributed according to the values in one column. The
leader node will attempt to place matching values on the same node
slice. If you distribute a pair of tables on the joining keys, the
leader node collocates the rows on the slices according to the values
in the joining columns so that matching values from the common columns
are physically stored together.
I was wondering if key-distribution additionally helps in optimizing equality filters. My intuition says it should but it isn't mentioned anywhere.
Also, I saw a documentation regarding sort-keys which says that to select a sort-key:
Look for columns that are used in range filters and equality filters.
This got me confused since sort-keys are explicitly mentioned as a way to optimize equality filters.
I am asking this because I already have a candidate sort-key on which I will be doing range queries. But I also want to have quick equality filters on another column which is a good distribution key in my case.
It is a very bad idea to be filtering on a distribution key, especially if your table / cluster is large.
The reason is that the filter may be running on just one slice, in effect running without the benefit of MPP.
For example, if you have a dist key of "added_date", you may find that all of the added date for the previous week are all together on one slice.
You will then have the majority of queries filtering for recent ranges of added_date, and these queries will be concentrated and will saturate that one slice.
The simple rule is:
Use DISTKEY for the column most commonly joined
Use SORTKEY for fields most commonly used in a WHERE statement
There actually are benefits to using the same field for SORTKEY and DISTKEY. From Choose the Best Sort Key:
If you frequently join a table, specify the join column as both the sort key and the distribution key.
This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
Feel free to do some performance tests -- create a few different versions of the table, and use INSERT or SELECT INTO to populate them. Then, try common queries to see how they perform.

Scalding operate on group after groupBy

I am writing a scalding job.
Here's what I want to do: first groupBy a Key. That should give me a bunch of (Key, Iterator[Value]) pairs for each Key (correct me if I'm wrong here). Then for each Ley, I want to apply a function on its associated Iterator[Value].
How should I do this? I am currently using a groupBy followed by a mapGroup. I get an Iterator[Value] but that iterator only has one value for some reason. mapGroup is not getting to operate on multiple values. It's important that I get to see all the values for any given key at the same time. Any ideas?
Thanks!

Django: how to filter() after distinct()

If we chain a call to filter() after a call to distinct(), the filter is applied to the query before the distinct. How do I filter the results of a query after applying distinct?
Example.objects.order_by('a','foreignkey__b').distinct('a').filter(foreignkey__b='something')
The where clause in the SQL resulting from filter() means the filter is applied to the query before the distinct. I want to filter the queryset resulting from the distinct.
This is probably pretty easy, but I just can't quite figure it out and I can't find anything on it.
Edit 1:
I need to do this in the ORM...
SELECT z.column1, z.column2, z.column3
FROM (
SELECT DISTINCT ON (b.column1, b.column2) b.column1, b.column2, c.column3
FROM table1 a
INNER JOIN table2 b ON ( a.id = b.id )
INNER JOIN table3 c ON ( b.id = c.id)
ORDER BY b.column1 ASC, b.column2 ASC, c.column4 DESC
) z
WHERE z.column3 = 'Something';
(I am using Postgres by the way.)
So I guess what I am asking is "How do you nest subqueries in the ORM? Is it possible?" I will check the documentation.
Sorry if I was not specific earlier. It wasn't clear in my head.
This is an old question, but when using Postgres you can do the following to force nested queries on your 'Distinct' rows:
foo = Example.objects.order_by('a','foreign_key__timefield').distinct('a')
bar = Example.objects.filter(pk__in=foo).filter(some_field=condition)
bar is the nested query as requested in OP without resorting to raw/extra etc. Tested working in 1.10 but docs suggest it should work back to at least 1.7.
My use case was to filter up a reverse relationship. If Example has some ForeignKey to model Toast then you can do:
Toast.objects.filter(pk__in=bar.values_list('foreign_key',flat=true))
This gives you all instances of Toast where the most recent associated example meets your filter criteria.
Big health warning about performance though, using this if bar is likely to be a huge queryset you're probably going to have a bad time.
Thanks a ton for the help guys. I tried both suggestions and could not bend either of those suggestions to work, but I think it started me in the right direction.
I ended up using
from django.db.models import Max, F
Example.objects.annotate(latest=Max('foreignkey__timefield')).filter(foreignkey__timefield=F('latest'), foreign__a='Something')
This checks what the latest foreignkey__timefield is for each Example, and if it is the latest one and a=something then keep it. If it is not the latest or a!=something for each Example then it is filtered out.
This does not nest subqueries but it gives me the output I am looking for - and it is fairly simple. If there is simpler way I would really like to know.
No you can't do this in one simple SELECT.
As you said in comments, in Django ORM filter is mapped to SQL clause WHERE, and distinct mapped to DISTINCT. And in a SQL, DISTINCT always happens after WHERE by operating on the result set, see SQLite doc for example.
But you could write sub-query to nest SELECTs, this depends on the actual target (I don't know exactly what's yours now..could you elaborate it more?)
Also, for your query, distinct('a') only keeps the first occurrence of Example having the same a, is that what you want?