Utilizing Apache Calcite without connecting to DB - apache-calcite

I am trying to generate SQL query from relational algebra in Calcite without connecting to the database
I saw an example at Calcite website where a JDBC adapter uses the DB connection for the first time and all subsequent calls get data from cache. What I am looking for is not connecting to the DB at all, just do a translation from relational to SQL.
Any hint is much appreciated.

An example available in this notebook. I've included the relevant code below. You will need to create a schema instance for this to work. In this case, I've just used one of Calcite's test schemas.
import org.apache.calcite.jdbc.CalciteSchema;
import org.apache.calcite.rel.RelNode;
import org.apache.calcite.rel.RelWriter;
import org.apache.calcite.rel.externalize.RelWriterImpl;
import org.apache.calcite.rel.core.JoinRelType;
import org.apache.calcite.schema.SchemaPlus;
import org.apache.calcite.test.CalciteAssert;
import org.apache.calcite.tools.Frameworks;
import org.apache.calcite.tools.FrameworkConfig;
import org.apache.calcite.tools.RelBuilder;
SchemaPlus rootSchema = CalciteSchema.createRootSchema(true).plus();
FrameworkConfig config = Frameworks.newConfigBuilder()
.defaultSchema(
CalciteAssert.addSchema(rootSchema, CalciteAssert.SchemaSpec.HR))
.build();
RelBuilder builder = RelBuilder.create(config);
RelNode opTree = builder.scan("emps")
.scan("depts")
.join(JoinRelType.INNER, "deptno")
.filter(builder.equals(builder.field("empid"), builder.literal(100)))
.build();
The other step of converting to SQL is fairly straightforward by constructing an instance of RelToSqlConverter and calling it's visit method on the RelNode object.

You can use the new Quidem tests to run which would be much easier

Related

Send data from Cloud sql MSSQL to Bigquery using dataflow

I want to create a flex template in dataflow to send CDC data from MSSQL to bigquery, I was able to connect using pyodbc but I see that there is a library from apache_beam.io.jdbc import ReadFromJdbc that when using it gives me an error.
Using Pyodbc can I use a ParDo to create the transformations? what would the structure look like? from the pipeline?
thanks
import pyodbc
connection = pyodbc.connect('DRIVER='+driver+';SERVER='+
self.source.host+';PORT='+ str(int(self.source.port)) +
';DATABASE='+self.source.database+';UID='+
self.source.username+';PWD='+self.source.password)
I have reviewed the following doc:
https://www.case-k.jp/entry/2022/01/06/094509
https://docs.devart.com/odbc/bigquery/python.htm
I'm doing everything by private ip
import pyodbc
connection = pyodbc.connect('DRIVER='+driver+';SERVER='+
self.source.host+';PORT='+ str(int(self.source.port)) +
';DATABASE='+self.source.database+';UID='+
self.source.username+';PWD='+self.source.password)

How to connect Janusgraph deployed in GCP with Python runtime?

I have deployed the Janusgraph using Helm in Google cloud Containers, following the below documentation:
https://cloud.google.com/architecture/running-janusgraph-with-bigtable,
I'm able to fire the gremline query using Google Cloud Shell.
Snapshot of GoogleCLoud Shell
Now I want to access the Janusgraph using Python, I tried below line of code but it's unable to connect to Janusgraph inside GCP container.
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
graph = Graph()
g = graph.traversal().withRemote(DriverRemoteConnection('gs://127.0.0.1:8182/gremlin','g'))
value = g.V().has('name','hercules').values('age')
print(value)
here's the output I'm getting
[['V'], ['has', 'name', 'hercules'], ['values', 'age']]
Whereas the output should be -
30
Is there someone tried to access Janusgraph using Python inside GCP.
You need to end the query with a terminal step such as next or toList. What you are seeing is the query bytecode printed as the query was never submitted to the server due to the missing terminal step. So you need something like this:
value = g.V().has('name','hercules').values('age').next()
print(value)

How to Load Data into remote Neo4j AWS instance?

I want to import data into a Neo4j instance brought up in AWS (community edition from AWS marketplace). One option is to convert the data to CSV and run the LOAD CSV command in the Neo4j UI, and point it to a public http address that reads from S3. This, however, means we need to publicly expose the file externally which would expose sensitive data. How else can import this data?
Thanks!
I would suggest you use any of the Neo4j driver like Python or Java. Here is one Python example that I used in my posts:
def store_to_neo4j(distances):
data = [{'source': el[0], 'target': el[1], 'weight': distances[el]} for el in distances]
with driver.session() as session:
session.run("""
UNWIND $data as row
MERGE (c:Character{name:row.source})
MERGE (t:Character{name:row.target})
MERGE (c)-[i:INTERACTS]-(t)
SET i.weight = coalesce(i.weight,0) + row.weight
""", {'data': data})
You don't want to execute the import line by line, but you want to batch lets say 1000 rows in a parameter and then use UNWIND operator to import this data.

Run Redshift Queries Periodically

I have started researching into Redshift. It is defined as a "Database" service in AWS. From what I have learnt so far, we can create tables and ingest data from S3 or from external sources like Hive into Redhshift database (cluster). Also, we can use JDBC connection to query these tables.
My questions are -
Is there a place within Redshift cluster where we can store our queries run it periodically (like Daily)?
Can we store our query in a S3 location and use that to create output to another S3 location?
Can we load a DB2 table unload file with a mixture of binary and string fields to Redshift directly, or do we need a intermediate process to make the data into something like a CSV?
I have done some Googling about this. If you have link to resources, that will be very helpful. Thank you.
I used cursor method using psycopg2 function in python. The sample code is given below. You have to set all the redshift credentials in env_vars files.
you can set your queries using cursor.execute. here I mension one update query so you can set your query in this place (you can set multiple queries). After that you have to set this python file into crontab or any other autorun application for running your queries periodically.
import psycopg2
import sys
import env_vars
conn_string = "dbname=%s port=%s user=%s password=%s host=%s " %(env_vars.RedshiftVariables.REDSHIFT_DW ,env_vars.RedshiftVariables.REDSHIFT_PORT ,env_vars.RedshiftVariables.REDSHIFT_USERNAME ,env_vars.RedshiftVariables.REDSHIFT_PASSWORD,env_vars.RedshiftVariables.REDSHIFT_HOST)
conn = psycopg2.connect(conn_string);
cursor = conn.cursor();
cursor.execute("""UPDATE database.demo_table SET Device_id = '123' where Device = 'IPHONE' or Device = 'Apple'; """);
conn.commit();
conn.close();

How can I know if the connection to an existing database Neo4j has been successful?

I am newbie in python / django. Using neo4django library and an existing Neo4j database I would like to connect to it and test if the connection is successful. How can I achieve this behavior?
You don't 'connect' to a database anymore. This is the Frameworks job. You just define the parameters and start writing Models.
Those Models are your entities with fields which can be used like a variable. In other words: your models are your definition of the Database tables.
You can test against http://host:7474/db/data/ if it returns a 200, Ok.
I don't know much about neo4django, but you can test if a database is accessible with py2neo, the general purpose Python driver (http://py2neo.org/2.0/).
One simple way is to ask the server for its version:
from py2neo import Graph
from py2neo.packages.httpstream import SocketError
# adjust as necessary
graph = Graph("http://localhost:7474/db/data/")
try:
print(graph.neo4j_version)
except SocketError:
print('No connection to database.')