Send data from Cloud sql MSSQL to Bigquery using dataflow - google-cloud-platform

I want to create a flex template in dataflow to send CDC data from MSSQL to bigquery, I was able to connect using pyodbc but I see that there is a library from apache_beam.io.jdbc import ReadFromJdbc that when using it gives me an error.
Using Pyodbc can I use a ParDo to create the transformations? what would the structure look like? from the pipeline?
thanks
import pyodbc
connection = pyodbc.connect('DRIVER='+driver+';SERVER='+
self.source.host+';PORT='+ str(int(self.source.port)) +
';DATABASE='+self.source.database+';UID='+
self.source.username+';PWD='+self.source.password)
I have reviewed the following doc:
https://www.case-k.jp/entry/2022/01/06/094509
https://docs.devart.com/odbc/bigquery/python.htm
I'm doing everything by private ip
import pyodbc
connection = pyodbc.connect('DRIVER='+driver+';SERVER='+
self.source.host+';PORT='+ str(int(self.source.port)) +
';DATABASE='+self.source.database+';UID='+
self.source.username+';PWD='+self.source.password)

Related

How to connect Janusgraph deployed in GCP with Python runtime?

I have deployed the Janusgraph using Helm in Google cloud Containers, following the below documentation:
https://cloud.google.com/architecture/running-janusgraph-with-bigtable,
I'm able to fire the gremline query using Google Cloud Shell.
Snapshot of GoogleCLoud Shell
Now I want to access the Janusgraph using Python, I tried below line of code but it's unable to connect to Janusgraph inside GCP container.
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
graph = Graph()
g = graph.traversal().withRemote(DriverRemoteConnection('gs://127.0.0.1:8182/gremlin','g'))
value = g.V().has('name','hercules').values('age')
print(value)
here's the output I'm getting
[['V'], ['has', 'name', 'hercules'], ['values', 'age']]
Whereas the output should be -
30
Is there someone tried to access Janusgraph using Python inside GCP.
You need to end the query with a terminal step such as next or toList. What you are seeing is the query bytecode printed as the query was never submitted to the server due to the missing terminal step. So you need something like this:
value = g.V().has('name','hercules').values('age').next()
print(value)

Google cloud functions + google cloud sql + google cloud scheduler timeout

I am trying to deploy a python code with google cloud funtions and scheduler that writes a simple table to a google postgresql cloud database.
I created a postgre sql database
Added a Cloud SQL client role to the App Engine default service account
Created a Cloud Pub/Sup topic
Created a scheduler job.
So far so good. Then I created the following function main.py :
from sqlalchemy import create_engine
import pandas as pd
from google.cloud.sql.connector import Connector
import sqlalchemy
connector = Connector()
def getconn():
conn = connector.connect(
INSTANCE_CONNECTION_NAME,
"pg8000",
user=DB_USER,
password=DB_PASS,
db=DB_NAME
)
return conn
pool = sqlalchemy.create_engine(
"postgresql+pg8000://",
creator=getconn,
)
def testdf(event,context):
df=pd.DataFrame({"a" : [1,2,3,4,5],
"b" : [1,2,3,4,5]})
df.to_sql("test",
con = pool,
if_exists ="replace",
schema = 'myschema')
And the requirements.txt contains:
pandas
sqlalchemy
pg8000
cloud-sql-python-connector[pg8000]
When I test the function it always timeout. No error, just these logs:
I cant figure it out why. I have tried several code snippets from:
https://colab.research.google.com/github/GoogleCloudPlatform/cloud-sql-python-connector/blob/main/samples/notebooks/postgres_python_connector.ipynb#scrollTo=UzHaM-6TXO8h
and from
https://codelabs.developers.google.com/codelabs/connecting-to-cloud-sql-with-cloud-functions#2
I think the adjustments of the permissions and roles cause the timeout. Any ideas?
Thx

Utilizing Apache Calcite without connecting to DB

I am trying to generate SQL query from relational algebra in Calcite without connecting to the database
I saw an example at Calcite website where a JDBC adapter uses the DB connection for the first time and all subsequent calls get data from cache. What I am looking for is not connecting to the DB at all, just do a translation from relational to SQL.
Any hint is much appreciated.
An example available in this notebook. I've included the relevant code below. You will need to create a schema instance for this to work. In this case, I've just used one of Calcite's test schemas.
import org.apache.calcite.jdbc.CalciteSchema;
import org.apache.calcite.rel.RelNode;
import org.apache.calcite.rel.RelWriter;
import org.apache.calcite.rel.externalize.RelWriterImpl;
import org.apache.calcite.rel.core.JoinRelType;
import org.apache.calcite.schema.SchemaPlus;
import org.apache.calcite.test.CalciteAssert;
import org.apache.calcite.tools.Frameworks;
import org.apache.calcite.tools.FrameworkConfig;
import org.apache.calcite.tools.RelBuilder;
SchemaPlus rootSchema = CalciteSchema.createRootSchema(true).plus();
FrameworkConfig config = Frameworks.newConfigBuilder()
.defaultSchema(
CalciteAssert.addSchema(rootSchema, CalciteAssert.SchemaSpec.HR))
.build();
RelBuilder builder = RelBuilder.create(config);
RelNode opTree = builder.scan("emps")
.scan("depts")
.join(JoinRelType.INNER, "deptno")
.filter(builder.equals(builder.field("empid"), builder.literal(100)))
.build();
The other step of converting to SQL is fairly straightforward by constructing an instance of RelToSqlConverter and calling it's visit method on the RelNode object.
You can use the new Quidem tests to run which would be much easier

Google Cloud Composer and Google Cloud SQL

What ways do we have available to connect to a Google Cloud SQL (MySQL) instance from the newly introduced Google Cloud Composer? The intention is to get data from a Cloud SQL instance into BigQuery (perhaps with an intermediary step through Cloud Storage).
Can the Cloud SQL proxy be exposed in some way on pods part the Kubernetes cluster hosting Composer?
If not can the Cloud SQL Proxy be brought in by using the Kubernetes Service Broker? -> https://cloud.google.com/kubernetes-engine/docs/concepts/add-on/service-broker
Should Airflow be used to schedule and call GCP API commands like 1) export mysql table to cloud storage 2) read mysql export into bigquery?
Perhaps there are other methods that I am missing to get this done
"The Cloud SQL Proxy provides secure access to your Cloud SQL Second Generation instances without having to whitelist IP addresses or configure SSL." -Google CloudSQL-Proxy Docs
CloudSQL Proxy seems to be the recommended way to connect to CloudSQL above all others. So in Composer, as of release 1.6.1, we can create a new Kubernetes Pod to run the gcr.io/cloudsql-docker/gce-proxy:latest image, expose it through a service, then create a Connection in Composer to use in the operator.
To get set up:
Follow Google's documentation
Test the connection using info from Arik's Medium Post
Check that the pod was created kubectl get pods --all-namespaces
Check that the service was created kubectl get services --all-namespaces
Jump into a worker node kubectl --namespace=composer-1-6-1-airflow-1-10-1-<some-uid> exec -it airflow-worker-<some-uid> bash
Test mysql connection mysql -u composer -p --host <service-name>.default.svc.cluster.local
Notes:
Composer now uses namespaces to organize pods
Pods in different namespaces don't talk to each other unless you give them the full path <k8-service-name>.<k8-namespace-name>.svc.cluster.local
Creating a new Composer Connection with the full path will enable successful connection
We had the same problem but with a Postgres instance. This is what we did, and got it to work:
create a sqlproxy deployment in the Kubernetes cluster where airflow runs. This was a copy of the existing airflow-sqlproxy used by the default airflow_db connection with the following changes to the deployment file:
replace all instances of airflow-sqlproxy with the new proxy name
edit under 'spec: template: spec: containers: command: -instances', replace the existing instance name with the new instance we want to connect to
create a kubernetes service, again as a copy of the existing airflow-sqlproxy-service with the following changes:
replace all instances of airflow-sqlproxy with the new proxy name
under 'spec: ports', change to the appropriate port (we used 5432 for a Postgres instance)
in the airflow UI, add a connection of type Postgres with host set to the newly created service name.
You can follow these instructions to launch a new Cloud SQL proxy instance in the cluster.
re #3: That sounds like a good plan. There isn't a Cloud SQL to BigQuery operator to my knowledge, so you'd have to do it in two phases like you described.
Adding the medium post in the comments from #Leo to the top level https://medium.com/#ariklevliber/connecting-to-gcp-composer-tasks-to-cloud-sql-7566350c5f53 . Once you follow that article and have the service setup you can connect from your DAG using SQLAlchemy like this:
import os
from datetime import datetime, timedelta
import logging
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
logger = logging.getLogger(os.path.basename(__file__))
INSTANCE_CONNECTION_NAME = "phil-new:us-east1:phil-db"
default_args = {
'start_date': datetime(2019, 7, 16)
}
def connect_to_cloud_sql():
'''
Create a connection to CloudSQL
:return:
'''
import sqlalchemy
try:
PROXY_DB_URL = "mysql+pymysql://<user>:<password>#<cluster_ip>:3306/<dbname>"
logger.info("DB URL", PROXY_DB_URL)
engine = sqlalchemy.create_engine(PROXY_DB_URL, echo=True)
for result in engine.execute("SELECT NOW() as now"):
logger.info(dict(result))
except Exception:
logger.exception("Unable to interact with CloudSQL")
dag = DAG(
dag_id="example_sqlalchemy",
default_args=default_args,
# schedule_interval=timedelta(minutes=5),
catchup=False # If you don't set this then the dag will run according to start date
)
t1 = PythonOperator(
task_id="example_sqlalchemy",
python_callable=connect_to_cloud_sql,
dag=dag
)
if __name__ == "__main__":
connect_to_cloud_sql()
Here, in Hoffa's answer to a similar question, you can find a reference on how Wepay keeps it synchronized every 15 minutes using an Airflow operator.
From said answer:
Take a look at how WePay does this:
https://wecode.wepay.com/posts/bigquery-wepay
The MySQL to GCS operator executes a SELECT query against a MySQL
table. The SELECT pulls all data greater than (or equal to) the last
high watermark. The high watermark is either the primary key of the
table (if the table is append-only), or a modification timestamp
column (if the table receives updates). Again, the SELECT statement
also goes back a bit in time (or rows) to catch potentially dropped
rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL
database every 15 minutes.
Now we can connect to Cloud SQL without creating a cloud proxy ourselves. The operator will create it automatically. The code look like this:
from airflow.models import DAG
from airflow.contrib.operators.gcp_sql_operator import CloudSqlInstanceExportOperator
export_body = {
'exportContext': {
'fileType': 'CSV',
'uri': EXPORT_URI,
'databases': [DB_NAME],
'csvExportOptions': {
'selectQuery': SQL
}
}
}
default_dag_args = {}
with DAG(
'postgres_test',
schedule_interval='#once',
default_args=default_dag_args) as dag:
sql_export_task = CloudSqlInstanceExportOperator(
project_id=GCP_PROJECT_ID,
body=export_body,
instance=INSTANCE_NAME,
task_id='sql_export_task'
)

Run Redshift Queries Periodically

I have started researching into Redshift. It is defined as a "Database" service in AWS. From what I have learnt so far, we can create tables and ingest data from S3 or from external sources like Hive into Redhshift database (cluster). Also, we can use JDBC connection to query these tables.
My questions are -
Is there a place within Redshift cluster where we can store our queries run it periodically (like Daily)?
Can we store our query in a S3 location and use that to create output to another S3 location?
Can we load a DB2 table unload file with a mixture of binary and string fields to Redshift directly, or do we need a intermediate process to make the data into something like a CSV?
I have done some Googling about this. If you have link to resources, that will be very helpful. Thank you.
I used cursor method using psycopg2 function in python. The sample code is given below. You have to set all the redshift credentials in env_vars files.
you can set your queries using cursor.execute. here I mension one update query so you can set your query in this place (you can set multiple queries). After that you have to set this python file into crontab or any other autorun application for running your queries periodically.
import psycopg2
import sys
import env_vars
conn_string = "dbname=%s port=%s user=%s password=%s host=%s " %(env_vars.RedshiftVariables.REDSHIFT_DW ,env_vars.RedshiftVariables.REDSHIFT_PORT ,env_vars.RedshiftVariables.REDSHIFT_USERNAME ,env_vars.RedshiftVariables.REDSHIFT_PASSWORD,env_vars.RedshiftVariables.REDSHIFT_HOST)
conn = psycopg2.connect(conn_string);
cursor = conn.cursor();
cursor.execute("""UPDATE database.demo_table SET Device_id = '123' where Device = 'IPHONE' or Device = 'Apple'; """);
conn.commit();
conn.close();