What is the best and efficient approach to write a inner join in apache beam? - google-cloud-platform

suppose my query is: "select b.* from sourav_test.test1 a inner join sourav_test.test2 b on a.id=b.id". I need the best and efficient approach for apache beam to write this.

In Apache Beam SDK 2.5 a great approach is using the join library which performs SQL like joins. In the case of inner joins, the syntax would be as follows:
innerJoin(PCollection<KV<K,V1>> leftCollection,PCollection<KV<K,V2>> rightCollection)
Relating to your case, the left and side collections represents the collections to be inner joined. The K value would be the type of the key related to both collections. The Vs would represent the values of each collection respectively.

Related

how to evaluate sql query using RelNode object

I am trying to convert sql query to Tinkerpop Gremlin. sql2Gremlin library does it but it looks on join as relation while I am relying on no join approach where you can refer relations with dot as delimiter between two entity.
I have parsed and validated query and I have RelRoot object.
Apache calcite returns RelRoot object which is root of algebraic expression.
Lets say I dont want to apply any query optimization, How do i use my RelNode Visitor to transform the RelRoot into TinkerPop Gremlin DSL.
Ideally I would first use From clause and then apply filters defined in where clause? How is select, filters, From clause represent in RelRoot tree?
What does apache calcite means by relational expression or RelNode?
Rephrasing the same question without TinkerPop Gremlin context:
How should I use RelRoot visitor to visit the RelRoot and transform the query to another DSL?
I don't know why you insist on RelRoot and not RelNode tree, but Apache Calcite is doing its optimizations of relational algebra in RelNode stack. There is a class called RelVisitor that you might find interesting, since it can do exactly what you need: visit all RelNodes. You can then extract information you need from them and build your DSL with it.
EDIT: In RelVisitor, you have access to the parent node and the child nodes of the currently visited node. You can extract all the information usually available to the RelNode object (see docs), and if you cast it to specific relational algebra operation, for example, Project, you can extract what fields are inside Project operation by doing node.getRowType().getFieldList().forEach(field -> names.add(field.getName())), where names is a previously defined Set<String>. You can find the full code here.
You should also take a look at the algebra docs to understand how SQL maps to relational algebra in Calcite before attempting this.

Django: SQL way (i.e using annotation, aggregate, subquery, when, case, window etc) vs python way

Sql way means using using annotation, aggregate, subquery, when, case, window etc. i.e get all the extra calculation columns from sql
Till now if I want to get some addition information from the data of each row like calculate something etc, I used to get the table in queryset and loop over each object and then store the desired result in a list of dicts and pass it to the template. Ofcourse i was using prefetch so that i can avoid N+1 queries.
But since django 1.11 we can do the same thing in more expressive way using sql (i.e using annotation, aggregate, subquery, when, case, window etc) and no python.
Disadvantage:
One disadvantage i found using sql way rather than python way is debugging.
I have to do complex calculations on the data of each queryset object. So that i can check step by step.
If i do it in the sql way i can only see the final result but not be able to track the steps.
Advantage:
I didnt tried, but heard from The Dramatic Benefits of Django Subqueries and Annotations that its quite fast.
Presently most of my sql takes less than 100 ms and most of the time goes in dom loading. So by using sql way will it help me any way.
I appreciate that Django created more functions which help to write the sql expressively.

What is the best way to do intense read only queries in Django

We have a really big application in Django which uses Postgres database. We want to build an analytics module.
This module uses a base query e.g.
someFoo = SomeFoo.objects.all() # Around 100000 objects returned.
Then slice and dice this data. i.e.
someFoo.objects.filter(Q(creator=owner) | Q(moderated=False))
These queries will be very intense and as this will be an analytics and reporting dashboard the quires will hit the database very badly.
What is the best way to handle complex queries in such conditions ? i.e. when you have a base query and it will be sliced and diced very often in a span of short time and never be used again.
A few possible solutions that we have though of are
A read only database and a write only database.
Writing Raw sql queries and using them. As django ORM can be quite inefficient for certain types of queries.
Caching heavily (Have not though or done any research in this.)
Edit : E.g. query
select sport."sportName", sport.id, pop.name, analytics_query.loc_id, "new count"
from "SomeFoo_sportpop" as sportpop join "SomeFoo_pop" as pop on (sportpop.pop_id=pop.id) join "SomeFoo_sport" as sport on (sportpop.sport_id=sport.id) join
(select ref.catcher_pop_id as loc_id,
(select count(*) from "SomeFoo_pref" where catcher_pop_id=ref.catcher_pop_id and status='pending' and exists=True) as "new count"
from "SomeFoo_pref" as ref
where ref.exists=TRUE and ref.catcher_pop_id is not NULL
group by ref.catcher_pop_id) as analytics_query on (sportpop.pop_id=analytics_query.loc_id)
order by sport."sportName", pop.name asc
This is an example of a raw sql query we are planning to make and its going to have a lot of where statements and groupby. Basically we are going to slice and dice the base query a lot.
Is there any other possible solution or method that you can point us to. Any help is highly appreciated.
I can think to PREPARED STATMENT and a faster server, may be on linux...

QSqlQuery using with indexes

I have my own data store mechanism for store data. but I want to implement standards data manipulation and query interface for end users,so I thought QT sql is suitable for my case.
but I still cannot understand how do I involved my indexes for sql query.
let say for example,
I have table with column A(int),B(int),C(int),D(int) and column A is indexed.assume I execute query like select * from Foo where A = 10;
How do I involved my index for search the results?.
You have written your own storage system and want to manipulate it using an SQL like syntax? I don't think Qt SQL is the right tool for that job. It offers connectivity to various SQL servers and is not meant for parsing SQL statements. Qt expects to "pass through" the queries and then somehow parse the result set and transform it into a Qt friendly representation.
So if you only want to have a Qt friendly representation, I wouldn't see a reason to go the indirection with SQL.
But regarding your problem:
In SQL, indexes are usually not stated in the queries, but during the creation of the table schema. But SQL server has a possibility to "hint" indexes, is that what you are looking for?
SELECT column_list FROM table_name WITH (INDEX (index_name) [, ...]);

how to join tables in hbase

I have to join tables in Hbase.
I integrated HIVE and HBase and that is working well. I can query using HIVE.
But can somebody help me how to join tables in HBase without using HIVE. I think using mapreduce we can achieve this, if so can anybody share a working example that I can refer.
Please share your opinions.
I have an approach in mind. That is,
If I need to JOIN tables A x B x C;
I may use TableMapReduceUtil to iterate over A, then get Data from B and C inside the TableMapper. Then use the TableReducer to write back to another table Y.
Will this approach be a good one.
That is certainly an approach, but if you are doing 2 random reads per scanned row then your speed will plummet. If you are filtering the rows out significantly or have a small dataset in A that may not be an issue.
Sort-merge Join
However the best approach, which will be available in HBase 0.96, is the MultipleTableInput method. This means that it will scan table A and write it's output with a unique key that will allow table B to match up.
E.g. Table A emits (b_id, a_info) and Table B will emit (b_id, b_info) merging together in the reducer.
This is an example of a sort-merge join.
Nested-Loop Join
If you are joining on the row key or the joining attribute is sorted in line with table B, you can have a instance of a scanner in each task which sequentially reads from table B until it finds what it's looking for.
E.g. Table A row key = "companyId" and Table B row key = "companyId_employeeId". Then for each Company in Table A you can get all the employees using the nest-loop algorithm.
Pseudocode:
for(company in TableA):
for(employee in TableB):
if employee.company_id == company.id:
emit(company.id, employee)
This is an example of a nest-loop join.
More detailed join algorithms are here:
http://en.wikipedia.org/wiki/Nested_loop_join
http://en.wikipedia.org/wiki/Hash_join
http://en.wikipedia.org/wiki/Sort-merge_join