When I use join query with calcite, how to use small table data as query filter condition of large table? - apache-calcite

When I use join query with calcite, how can I optimize the query plan so that it can query the small table first, and then use the result of the small table as the filter condition of the large table?

Related

Clean up a dimension table

My dimension tables contains more rows than my fact table, I would like my dim table fields to show only the values in the fact table when used as a filter in the filter panel.What cleaning/modeling steps are the best to achieve this.
I also know how to write sql if that is an option for an answer.
Rather than use a table for your dimension, use a view that has an inner join to the fact table

AWS Athena query on parquet file - using columns in where clause

We are planning to use Athena as a backend service for our data(stored as parquet files in partitions) in S3.
Some of the things we are interested to find out is how does adding additional columns in where clause of the query affect the query run time.
For example, we have 10million records in one hive partition(partition based on column 'date')
And all queries below return same volume - 10million. would all these queries take same time or does it reduce query run when we add additional columns in where clause(as parquet is columnar fomar)?
I tried to test this but results were not consistent as there was some queuing time as well I guess
select * from table where date='20200712'
select * from table where date='20200712' and type='XXX'
select * from table where date='20200712' and type='XXX' and subtype='YYY'
Parquet file contains page "indexes" (min, max and bloom filters.) If you sorting the data by columns in question during insert for example like this:
insert overwrite table mytable partition (dt)
select col1, --some columns
type,
subtype,
dt
distribute by dt
sort by type, subtype
then these indexes may work efficiently because data withe the same type, subtype will be loaded into the same pages, data pages will be selected using indexes. See some benchmarks here: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Switch-on predicate-push-down: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html

Using merge in Power Query while keeping native query

I'm trying to reduce my dataset of 1.000.000 records to only the subset I need (+/- 500) by creating an Inner Join to a different table. Unfortunataly it seems that Power Query drops the "native query" and loads the entire dataset before reducing it by merging it with a related table. I have no access to the database unfortunately, otherwise I would have written the SQL myself. Is there a way to make merge work with a native SQL query?
Thanks
I would first check that your "related table" query can run as a native query - right-click on it's last step and check if View Native Query is enabled.
If that's the case, then it may be due to the Join Kind in the Merge Queries step. I've noticed that against SQL Server data sources, Join Kinds other than the default Left Outer Join tend to kill the Native Query option.

Athena: Minimize data scanned by query including JOIN operation

Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date.
When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that date. The entire data is not scanned here.
However, if the smaller table is not partitioned on timeid, full data scan takes place even in the presence of condition on date column.
Is there a way to avoid full data scan even when the large partitioned table is joined with an unpartitioned small table? This is required because the small table contains only one record per timeid and it might not be expected to create a separate file for each.
That's an interesting discovery!
You might be able to avoid the large scan by using a sub-query instead of a join.
Instead of:
SELECT ...
FROM large-table
JOIN small-table
WHERE small-table.date > '2017-08-03'
you might be able to use:
SELECT ...
FROM large-table
WHERE large-table.date IN
(SELECT date from small-table
WHERE date > '2017-08-03')
I haven't tested it, but that would avoid the JOIN you mention.

In power query language(M language) how can we add custom "value" and "table" columns to a table manually?

In power query if we get data from an sql database, "Value" and "Table" columns are created automatically if there are relationships in the database.
AFAIK "Table" and "Value" means one-to-many and many-to-one relationships respectively.
My problem is that there are no relationships in our database. So PowerQuery cannot generate these columns automatically. How can I manually add these columns if I know the relationships between the subject tables?
I found Table.NestedJoin function which returns Table object(but with low performance, even though there are relationships in the database.)
But I could not find any function which returns a Value object(record of another table).
Possible other solutions with flaws are;
You may advise that I get the tables as in the database and create relationships in Relationships section in Power BI(or in power pivot section in Excel). But I need this Value object in power query because I would like to filter the rows according to the related table before loading all the rows of the table.
Creating a native query which joins the tables which is not my preference.
Creating Table object instead of a Value object(we are sure that only one record will come.) Still I have a performance problem with Table.NestedJoin method. Is there another option?
Thanks in advance...
Just today I had quite same issue with performance, but finally solved it. In my solution I work with views, but need to filter records coming.
When I use such a code:
let
filter1 = 2016,
filter2 = "SomeText",
tbl = Sql.Database("MyServer","MyDB"){Schema="dbo",Item="MyTableOrView"}[Data],
filteredTable = Table.SelectRows(tbl, each ([field1] = filter1) and ([field2] = filter2))
in
filteredTable
it works slow. But if I try NestedJoin - it performs much better.
let
Source = Table.FromColumns({{2016}, {"SomeText"}}, "filter1", "filter2"),
tbl = Sql.Database("MyServer","MyDB"){Schema="dbo",Item="MyTableOrView"}[Data],
filteredTable = Table.NestedJoin(tbl, {"field1", "field2"}, Source, {"filter1", "filter2"}, "NewColumn", JoinKind.Inner)
in
filteredTable
However, I noticed that even fastest design I got works slower than just a query that returns all ~~1300 rows from the view.
I have no SQL Profiler to track down what is exactly sent to the server, but it seems to me that query folding work when you use inner joins.
Try following: make 2 queries to 2 tables (no other actions!) and inner join them, then see if it works faster.