RelNode of a query in which FROM clause itself has a query - apache-calcite

I want to achieve result from a table where I ORDER BY column id and I don't want id to be present in the result. I can achieve this using the following query.
SELECT COALESCE (col1, '**')
FROM (select col1, id FROM myDataSet.myTable WHERE col4 = 'some filter' ORDER BY id);
Now, I want to create a RelNode of the above query. As far as I know, in calcite, to perform table scan, there are only two methods scan(String tableName) and scan(Iterable<String> tableNames). Is there a way to scan(RelNode ) ? How to do this ?

The query
select col1, col2, col2 FROM myDataSet.myTable WHERE col4 = 'some filter' ORDER BY id
should also give you the desired result.
If you want to represent the query you have written more directly, you would start by constructing a RelNode for the query in the from clause, starting with a scan of myDataSet.myTable, adding the filter, and the order. Then you can project the specific set of columns you want.

Just simply create a RelNode of inner subquery and create another projection on top of it. Like so.
builder.scan('myTable')
.filter(builder.call(SqlStdOperator.EQUALS, builder.field(col4), builder.literal('some filter') )))
.project(builder.field('col1'), builder.field('id'))
.sort(builder.field('id'))
.project(builder.call(SqlStdOperator.COALESCE(builder.field('col1'), builder.literal('**'))))
.build()

Related

Query for listing Datasets and Number of tables in Bigquery

So I'd like make a query that shows all the datasets from a project, and the number of tables in each one. My problem is with the number of tables.
Here is what I'm stuck with :
SELECT
smt.catalog_name as `Project`,
smt.schema_name as `DataSet`,
( SELECT
COUNT(*)
FROM ***DataSet***.INFORMATION_SCHEMA.TABLES
) as `nbTable`,
smt.creation_time,
smt.location
FROM
INFORMATION_SCHEMA.SCHEMATA smt
ORDER BY DataSet
The view INFORMATION_SCHEMA.SCHEMATA lists all the datasets from the project the query is executed, and the view INFORMATION_SCHEMA.TABLES lists all the tables from a given dataset.
The thing is that the view INFORMATION_SCHEMA.TABLES needs to have the dataset specified like this give the tables informations : dataset.INFORMATION_SCHEMA.TABLES
So what I need is to replace the *** DataSet*** by the one I got from the query itself (smt.schema_name).
I am not sure if I can do it with a sub query, but I don't really know how to manage to do it.
I hope I'm clear enough, thanks in advance if you can help.
You can do this using some procedural language as follows:
CREATE TEMP TABLE table_counts (dataset_id STRING, table_count INT64);
FOR record IN
(
SELECT
catalog_name as project_id,
schema_name as dataset_id
FROM `elzagales.INFORMATION_SCHEMA.SCHEMATA`
)
DO
EXECUTE IMMEDIATE
CONCAT("INSERT table_counts (dataset_id, table_count) SELECT table_schema as dataset_id, count(table_name) from ", record.dataset_id,".INFORMATION_SCHEMA.TABLES GROUP BY dataset_id");
END FOR;
SELECT * FROM table_counts;
This will return something like:

Remove duplicates based on sort

I have a customers table with ID's and some datetime columns. But those ID's have duplicates and i just want to Analyse distinct ID values.
I tried using groupby but this makes the process very slow.
Due to data sensitivity can't share it.
Any suggestions would be helpful.
I'd suggest using ROW_NUMBER() This lets you rank the rows by chosen columns and you can then pick out the first result.
Given you've shared no data or table and column names here's an example based on the Adventureworks database. The technique will be the same, you partition by whatever makes the group of rows you want to deduplicate unique (ProductKey below) and order in a way that makes the version you want to keep first (Children, birthdate and customerkey in my example).
USE AdventureWorksDW2017;
WITH CustomersOrdered AS
(
SELECT S.ProductKey, C.CustomerKey, C.TotalChildren, C.BirthDate
, ROW_NUMBER() OVER (
PARTITION BY S.ProductKey
ORDER BY C.TotalChildren DESC, C.BirthDate DESC, C.CustomerKey ASC
) AS CustomerSequence
FROM dbo.FactInternetSales AS S
INNER JOIN dbo.DimCustomer AS C
ON S.CustomerKey = C.CustomerKey
)
SELECT ProductKey, CustomerKey
FROM CustomersOrdered
WHERE CustomerSequence = 1
ORDER BY ProductKey, CustomerKey;
you can also just sort the columns with date column an than click on id column and delete duplicates...

How to substitute NULL with value in Power BI when joining one to many

In my model I have table AssignedToUser that don't contain any NULL values.
Table TotalCounts contains number of tasks for each User.
Those two table joined on UserGUID, and table TotalCounts contains NULL value for UserGUID.
When I drop everything in one table there is NULL value for AssignedToUser.
How can I substitute value NULL for AssignedToUser for "POOL".
Under EditQuery I tried to Create additional column
if [AssignedToUser] = null then "POOL" else [AssignedToUser]
But that didnt help.
UPDATE:
Thanks Alexis.
I have created FullAssignedToUsers table, but when I try to make a relationship with TotalCounts on UserGUID - it doesnt like it.
Data in new a table looks like this:
UPDATE:
File .ipbx can be accessed here:
https://www.dropbox.com/s/95frggpaq6tce7q/User%20Open%20Closed%20Tasks%20Experiment.pbix?dl=0
I believe the problem here is that your join has UserGUID values that are not in your AssignedToUsers table.
To correct this, one possibility is to replace your AssignedToUsers table with one that contains all the UserGUID values from the TotalCounts table.
FullAssignedToUsers =
ADDCOLUMNS(VALUES(TotalCounts[UserGUID]),
"AssignedToUser",
LOOKUPVALUE(AssignedToUsers[AssignedToUser],
AssignedToUsers[UserGUID], TotalCounts[UserGUID]))
The should get you the full outer join. You can then create the custom column like you described in the table and use that column in your visual.
You'll probably want to break the relationships with the original AssignedToUsers table and create relationships with the new one instead.
If you don't want to take that extra step, you can do an ISBLANK inside your new table formula.
FullAssignedToUsers =
ADDCOLUMNS(VALUES(TotalCounts[UserGUID]),
"AssignedToUser",
IF(
ISBLANK(
LOOKUPVALUE(AssignedToUsers[AssignedToUser],
AssignedToUsers[UserGUID], TotalCounts[UserGUID])),
"POOL",
LOOKUPVALUE(AssignedToUsers[AssignedToUser],
AssignedToUsers[UserGUID], TotalCounts[UserGUID])))
Note: This is equivalent to doing a right outer join merge on the AssignedToUsers table in the query editor and then replacing the nulls with "POOL". I'd actually recommend approaching it that way instead.
Another way to approach it is to pull the AssignedToUser column over to the TotalCounts table in a custom column and use that column in your visual instead.
AssignedToUsers =
IF(ISBLANK(RELATED(AssignedToUsers[AssignedToUser])),
"POOL",
RELATED(AssignedToUsers[AssignedToUser]))
This is equivalent to doing a left outer join merge on the TotalCounts table in the query editor, expanding the AssignedToUser column, then replacing nulls with "POOL" in that column.
In Dax missing values are Blank() not null. Try this:
=IF(
ISBLANK(AssignedToUsers[AssignedToUser]),
"Pool",
AssignedToUsers[AssignedToUser]
)

Using another table in BigQuery Regex

I would like to map a string column to a category based on a regular expression match.
Is it possible to use another bigquery table containing the regular expressions and corresponding category for this? This would make it easier for me to update only a table when adding new categories/updating the regex, instead of having to update all queries that would use this lookup.
Query:
CASE
-- Use the entries from another table here
WHEN REGEXP_MATCH(string_to_check, cat1regex) THEN cat1
WHEN REGEXP_MATCH(string_to_check, cat2regex) THEN cat2
etc.
END
Mapping table:
Regex category
pagex|pagey xy
pagez|page1 z1
It's also possible there is another simple way to do something similar that I'm not thinking of, answers pointing those out are welcome too.
Any help would be appreciated.
Below is for BigQuery Standard SQL
#standardSQL
SELECT
string_to_check,
MAX(IF(REGEXP_CONTAINS(string_to_check, reg), category, NULL)) AS category
FROM yourTable
CROSS JOIN mappingTable
GROUP BY string_to_check
You can test / play with it using below dummy date from your question
#standardSQL
WITH `mappingTable` AS (
SELECT r'pagex|pagey' AS reg, 'xy' AS category UNION ALL
SELECT r'pagez|page1', 'z1'
),
`yourTable` AS (
SELECT string_to_check
FROM UNNEST(["pagex.com", "pagez#example.org", "page.example.net"]) AS string_to_check
)
SELECT
string_to_check,
MAX(IF(REGEXP_CONTAINS(string_to_check, reg), category, NULL)) AS category
FROM yourTable
CROSS JOIN mappingTable
GROUP BY string_to_check

How to phrase sql query when selecting second table based on information on first table

I have two tables I would like to call, but I am not sure if it is possible to combine them into one query or I have to some how call 2 different queries.
Basically I have 2 tables:
1) item_table: name/id etc. + category ID
2) category_table: categoryID, categoryName, categoryParentID.
The parent categories are also inside the same table with their own name.
I would like to call on my details from item_table, as well as getting the name of the category, as well as the NAME of the parent category.
I know how to get the item_table data, plus the categoryName through an INNER JOIN. But can I use the same query to get the categoryParent's name?
If not, what would be the mist efficient way to do it? The rest of the code is in C++.
SELECT item_table.item_name, c1.name AS CatName, c2.name AS ParentCatName
FROM item_table join category_table c1 on item_table.categoryID=c1.categoryID
LEFT OUTER JOIN category_table c2 ON c2.categoryID = c1.categoryParentID
SQL Fiddle: here