Calcite doing full table scan - apache-calcite

I'm using Calcite to query from both MySql and Vertica.
When running this query:
statement.executeQuery(
"SELECT a.name, b.name " +
"FROM mysqlschema.tableA as a " +
"INNER JOIN verticaschema.tableB as b ON a.id = b.id " +
"WHERE a.id = 1 AND b.id = 1 "));
For some reason, I see that Calcite is properly accesing tableA whith the correct predicate but it's doing SELECT * FROM verticaschema.tableB for some reason over the second table.
Is there a way of optimizing it so Calcite will run the predicate b.id=1 over tableB too?
Thanks

Apache Calcite has some limitations:
Current limitations: The JDBC adapter currently only pushes down table scan operations; all other processing (filtering, joins, aggregations and so forth) occurs within Calcite. Our goal is to push down as much processing as possible to the source system, translating syntax, data types and built-in functions as we go. If a Calcite query is based on tables from a single JDBC database, in principle the whole query should go to that database. If tables are from multiple JDBC sources, or a mixture of JDBC and non-JDBC, Calcite will use the most efficient distributed query approach that it can.
You should implement it yourself.

Related

Athena equivalent to information_schema

For background, I come from a SQLServer background and make heavy use of the system tables & information_schema, to tell me all about my tables and columns.
I didn't expect the exact same power in Athena, but currently very shocked and frustrated with what little seems to be available - unless I've missed something ?
For example, 'describe mytable' - just describes 1 table at a time.
How about showing the columns for ALL tables in one result ?
It also does not output the table name, nor allow you to manually add that in as a custom column.
All the results of these "show/list/describe" commands seem to produce a text list - not a recordset, so you cannot take the results and join them to other tables or views to make more complex outputs.
Is there any other way to query the contents of my databases ?
Thanks in advance
Athena is based on Presto. Presto provides information_schema schema and I checked and it is accessible in Athena.
You can run e.g. a query like:
SELECT * FROM information_schema.columns;
to get a list of columns of all tables.
You can filter this by "database":
SELECT * FROM information_schema.columns WHERE table_schema = '<databasename>';
Note however that these types of queries are not necessarily very performant.

Using merge in Power Query while keeping native query

I'm trying to reduce my dataset of 1.000.000 records to only the subset I need (+/- 500) by creating an Inner Join to a different table. Unfortunataly it seems that Power Query drops the "native query" and loads the entire dataset before reducing it by merging it with a related table. I have no access to the database unfortunately, otherwise I would have written the SQL myself. Is there a way to make merge work with a native SQL query?
Thanks
I would first check that your "related table" query can run as a native query - right-click on it's last step and check if View Native Query is enabled.
If that's the case, then it may be due to the Join Kind in the Merge Queries step. I've noticed that against SQL Server data sources, Join Kinds other than the default Left Outer Join tend to kill the Native Query option.

Google Spanner - How do you copy data to another table?

Since spanner does not have ddl feature like
insert into dest as (select * from source_table)
How do we select subset of a table and copy that rows into another table ?
I am trying to write data to temporary table and then move data to archive table at the end of day. But only solution i could find so far is, select rows from source table and write them to new table. Which is done using java api, and it does not have a ResultSet to Mutation converter, so i need to map every column of table to new table, even they are exactly same.
Another thing is updating just one column data, like there is no way of doing "update table_name set column= column-1 "
Again to do that, i need to read that row and map every field to update Mutation, but this is not useful if have many tables, i need to code for all of them, a ResultSet -> Mutation converted would be nice too.
Is there any generic Mutation cloner and/or any other way to copy data between tables?
As of version 0.15 this open source JDBC Driver supports bulk INSERT-statements that can be used to copy data from one table to another. The INSERT-syntax can also be used to perform bulk UPDATEs on data.
Bulk insert example:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT C1, C2, C3
FROM OTHER_TABLE
WHERE C1>1000
Bulk update is done using an INSERT-statement with the addition of ON DUPLICATE KEY UPDATE. You have to include the value of the primary key in your insert statement in order to 'force' a key violation which in turn will ensure that the existing rows will be updated:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT COL1, COL2+1, COL3+COL2
FROM TABLE
WHERE COL2<1000
ON DUPLICATE KEY UPDATE
You can use the JDBC driver with for example SQuirreL to test it, or to do ad-hoc data manipulation.
Please note that the underlying limitations of Cloud Spanner still apply, meaning a maximum of 20,000 mutations in one transaction. The JDBC Driver can work around this limit by specifying the value AllowExtendedMode=true in your connection string or in the connection properties. When this mode is allowed, and you issue a bulk INSERT- or UPDATE-statement that will exceed the limits of one transaction, the driver will automatically open an extra connection and perform the bulk operation in batches on the new connection. This means that the bulk operation will NOT be performed atomically, and will be committed automatically after each successful batch, but at least it will be done automatically for you.
Have a look here for some more examples: http://www.googlecloudspanner.com/2018/02/data-manipulation-language-with-google.html
Another approach to perform Bulk update can be using LIMIT & OFFSET
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 1001);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 2001);
.
.
.
reach till where required.
PS: This is more of a trick. But will definitely save you time.
Spanner supports expression in the SET section of an UPDATE statement which can be used to supply a subquery fetching data from another table like this:
UPDATE target_table
SET target_field = (
-- use subquery as an expression (must return a single row)
SELECT source_table.source_field
FROM source_table
WHERE my_condition IS TRUE
) WHERE my_other_condition IS TRUE;
The generic syntax is:
UPDATE table SET column_name = { expression | DEFAULT } WHERE condition

How do i convert MSSQL query to Postgres Query

I have to migrate complex SQL query need to convert in Postgres.
Complex SQL query :
More than 4 table Join , lots of filter , Aggregate functions, CASE when then etc.
For Ex:
Sample input
Select
ROW_NUMBER() OVER (ORDER BY getdate()) AS ID,
GETDATE() as time,
arc.Customer as Customer,
GroupBy4KPI,
arc.MasterAccount,
arc.CustomerClass,
from tablename_1 arc left join tablename_2 varc on arc.Customer = varc.Customer
Please suggest if we have any custom function or tool to accomplish conversion of DML statement from MSSQL to PSQL.
Fact
Such, 1600 Queries i have to parse. So, this job is repetitive one. i have to perform or parse MSSQL queries on daily basis.
Any help on it would be much appreciated ?

SAS, SQL explicit passthrough, multiple Teradata databases

I have inherited a steep Teradata SQL query which runs on 3 Teradata databases.
Preferring not to get bogged down in the functional aspects of the query (with various windowing statements), I would like to pass the query explicitly through to Teradata (same server).
The construct that I am familiar with connects to only one database, e.g.:
proc sql;
connect to teradata (user="userid" password="password1" mode=teradata
database=DB1 tdpid="MyServer");
create table TD_Results as
select * from connection to TERADATA
(
... TD SQL CODE
... TD SQL CODE
);
quit;
Does anyone have an idea as to how the original TD SQL query referencing 3 databases could be used via passthrough?
Thanks.
Q.
What Teradata calls a DATABASE is what ORACLE calls a SCHEMA. You just use a two level name to reference the tables.
select a.x,b.y,c.z
from db1.table1 a
, db2.table2 b
, db3.table3 c
If you mean that you need to select from multiple servers then I think you need to look into using QueryGrid syntax. In that syntax you can add the server name with a trailing # on the table reference.
select a.x,b.y,c.z
from db1.table1 a
, db2.table2#server2 b
, db3.table3 c