I am trying the "Getting Started with Cloud Spanner in Python" guide for google cloud spanner.
I have created the instances databases e.t.c.
I have got to the "Create a database client" section.
We operate behind a firewall and have to set our proxy setting, we have done this successfully with Gsutil, BQ command line e.t.c.
When I set the proxy settings then try and execute the quickstart.py
I get error .
E0620 08:35:32.703000000 5020 src/core/ext/filters/client_channel/uri_parser.c:60] bad uri.scheme: 'xx.xxx.xxx.xxx:xx'
E0620 08:35:32.703000000 5020 src/core/ext/filters/client_channel/uri_parser.c:66] ^ here
E0620 08:35:32.703000000 5020 src/core/ext/filters/client_channel/http_proxy.c:56] cannot parse value of 'http_proxy' env var
It is at the line database.execute_sql('SELECT 1') where it all goes wrong.
If you have not seen the Quickstart example, here is the code.
#!/usr/bin/env python
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
def run_quickstart():
# [START spanner_quickstart]
# Imports the Google Cloud Client Library.
from google.cloud import spanner
# Instantiate a client.
spanner_client = spanner.Client()
# Your Cloud Spanner instance ID.
instance_id = 'im-spanner'
# Get a Cloud Spanner instance by ID.
instance = spanner_client.instance(instance_id)
# Your Cloud Spanner database ID.
database_id = 'd42'
# Get a Cloud Spanner database by ID.
database = instance.database(database_id)
# Execute a simple SQL statement.
results = database.execute_sql('SELECT 1')
for row in results:
print(row)
# [END spanner_quickstart]
if __name__ == '__main__':
run_quickstart()
I have double checked the proxy details and they are correct.
Can anyone help ?
Have you set http_proxy variable to point to your proxy? Please see https://github.com/grpc/grpc/blob/master/doc/environment_variables.md for information. If you use cloud libraries using gRPC to access through a Proxy, you need to set this variable to provide URI of Proxy to gRPC libraries.
Related
Need some support in building the cloud function that calls Oracle database, wrote the python code and it's on repo and the function calls it with an HTTP trigger, so that's good.
To connect to Oracle, Oracle Client Library is needed, and it's uploaded on cloud storage bucket.
So now the repo and bucket and the function are all set and in the same region, yet the function throws an error that it can't configure the oracle client library
Here is the code if it's important
import cx_Oracle
def queryOracleDatabase(request):
# Oracle Database Connection
username = 'x'
password = 'y'
connStr = '00.00.00.00:0000/abcd'
try:
conn = cx_Oracle.connect(username, password, connStr)
except cx_Oracle.DatabaseError as e:
error = e.args
print('Error: ', error.message)
return
# Execute the query
try:
cursor = conn.cursor()
cursor.execute('select * table')
data = cursor.fetchall()
except cx_Oracle.DatabaseError as e:
error = e.args
print('Error: ', error.message)
return
# Clean up
cursor.close()
conn.close()
return data
And this is the error it throws
Error: DPI-1047: Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory"
How to connect the function with the bucket?
Given more context provided by the comments, Cloud Functions don't fit so well the use case since they don't provide a persistent disk where you may store your oracle client lib.
Cloud functions is a very specialized service, such specialization provides a very low adopting curve and makes it the best choice when the use case and tech stack fit such specialization (eg.: No need of FS unless /tmp, or no need to customize the runtime/OS).
Instead, when the use case does require some degree of customization of the container where the function runs, Cloud Run comes to life. By simply defining a docker container you can make it host the oracle client lib in the FS (everywhere you need), as well as running your function reusing your current code almost as is.
I presume your teck stack is quite standard, so I would check the docker hub for an image based on python and maybe even the oracle SDK you need.. It would be an easy starting point.
About accessing the oracle client hosted on a bucket: the cloud function might download it to the /temp storage, but I'm not sure that you can actually load the lib from there. Such approach of storing libraries to buckets is something unusual to me (just personal experience).
Very new in Google Cloud Platform & hence asking basic question.
I am looking for an API which will be hosted in GCP. An External application will call the API to read data from BigQuery.
Can anyone help me out with any example Code/Approach?
Looking for an End-to-End cloud based solution based on Python
I can't provide you with a complete code example. But:
You can setup your python API using (Flask for example)
You can then use the python client to connect to BigQuery https://cloud.google.com/bigquery/docs/reference/libraries
Deploy your python API in Google App Engine, Cloud Run, Kubernetes, Compute, etc....
Do not forget to setup CORS and potential auth,
That's it
You can create a Python program using the Bigquery client, then deploy this program as a HTTP Cloud Function or Cloud Run service :
from flask import escape
from google.cloud import bigquery
import functions_framework
#functions_framework.http
def your_http_function(request):
#HTTP Cloud Function.
request_json = request.get_json(silent=True)
request_args = request.args
# example to retrieve argument param in the HTTP call
if request_json and 'name' in request_json:
name = request_json['name']
elif request_args and 'name' in request_args:
name = request_args['name']
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'TX'
GROUP BY name, state
ORDER BY total_people DESC
LIMIT 20
"""
query_job = client.query(query) # Make an API request.
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)
return rows
You have to deploy your Python code as a Cloud Function in this example
Your function can be invoked with a HTTP call with a param name :
https://GCP_REGION-PROJECT_ID.cloudfunctions.net/hello_http?name=NAME
You can also use Cloud Run that gives more flexibility because you deploy a Docker image.
I'm following the TFX on Cloud AI Platform Pipelines tutorial to implement a Kubeflow orchestrated pipeline on Google Cloud. The main difference is that I'm trying to implement an Object Detection solution instead of the Taxi application proposed by the tutorial.
For this reason I (locally) created a dataset of images labelled via labelImg and converted it to a .tfrecord using this script that I've uploaded on a GS bucket. Then I followed the TFX tutorial creating the GKE cluster (the default one, with this configuration) and the Jupyter Notebook needed to run the code, importing the same template.
The main difference is in the first component of the pipeline, where I changed the CSVExampleGen component to an ImportExampleGen one:
def create_pipeline(
pipeline_name: Text,
pipeline_root: Text,
data_path: Text,
# TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
# query: Text,
preprocessing_fn: Text,
run_fn: Text,
train_args: tfx.proto.TrainArgs,
eval_args: tfx.proto.EvalArgs,
eval_accuracy_threshold: float,
serving_model_dir: Text,
metadata_connection_config: Optional[
metadata_store_pb2.ConnectionConfig] = None,
beam_pipeline_args: Optional[List[Text]] = None,
ai_platform_training_args: Optional[Dict[Text, Text]] = None,
ai_platform_serving_args: Optional[Dict[Text, Any]] = None,
) -> tfx.dsl.Pipeline:
"""Implements the chicago taxi pipeline with TFX."""
components = []
# Brings data into the pipeline or otherwise joins/converts training data.
example_gen = tfx.components.ImportExampleGen(input_base=data_path)
# TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
# example_gen = tfx.extensions.google_cloud_big_query.BigQueryExampleGen(
# query=query)
components.append(example_gen)
No other components are inserted in the pipeline and the data path points to the location of the folder on the bucket containing the .tfrecord:
DATA_PATH = 'gs://(project bucket)/(dataset folder)'
This is the runner code (basically identical to the one of the TFX tutorial):
def run():
"""Define a kubeflow pipeline."""
# Metadata config. The defaults works work with the installation of
# KF Pipelines using Kubeflow. If installing KF Pipelines using the
# lightweight deployment option, you may need to override the defaults.
# If you use Kubeflow, metadata will be written to MySQL database inside
# Kubeflow cluster.
metadata_config = tfx.orchestration.experimental.get_default_kubeflow_metadata_config(
)
runner_config = tfx.orchestration.experimental.KubeflowDagRunnerConfig(
kubeflow_metadata_config=metadata_config,
tfx_image=configs.PIPELINE_IMAGE)
pod_labels = {
'add-pod-env': 'true',
tfx.orchestration.experimental.LABEL_KFP_SDK_ENV: 'tfx-template'
}
tfx.orchestration.experimental.KubeflowDagRunner(
config=runner_config, pod_labels_to_attach=pod_labels
).run(
pipeline.create_pipeline(
pipeline_name=configs.PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT,
data_path=DATA_PATH,
# TODO(step 7): (Optional) Uncomment below to use BigQueryExampleGen.
# query=configs.BIG_QUERY_QUERY,
preprocessing_fn=configs.PREPROCESSING_FN,
run_fn=configs.RUN_FN,
train_args=tfx.proto.TrainArgs(num_steps=configs.TRAIN_NUM_STEPS),
eval_args=tfx.proto.EvalArgs(num_steps=configs.EVAL_NUM_STEPS),
eval_accuracy_threshold=configs.EVAL_ACCURACY_THRESHOLD,
serving_model_dir=SERVING_MODEL_DIR,
# TODO(step 7): (Optional) Uncomment below to use provide GCP related
# config for BigQuery with Beam DirectRunner.
# beam_pipeline_args=configs
# .BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS,
# TODO(step 8): (Optional) Uncomment below to use Dataflow.
# beam_pipeline_args=configs.DATAFLOW_BEAM_PIPELINE_ARGS,
# TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
# ai_platform_training_args=configs.GCP_AI_PLATFORM_TRAINING_ARGS,
# TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
# ai_platform_serving_args=configs.GCP_AI_PLATFORM_SERVING_ARGS,
))
if __name__ == '__main__':
logging.set_verbosity(logging.INFO)
run()
The pipeline is then created and a run is invoked with the following code from the Notebook:
!tfx pipeline create --pipeline-path=kubeflow_runner.py --endpoint={ENDPOINT} --build-image
!tfx run create --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}
The problem is that, while the pipeline from the example runs without problem, this pipeline always fails with the pod on the GKE cluster exiting with code 137 (OOMKilled).
This is a snapshot of the cluster workload status and this is a full log dump of the run that crashes.
I've already tried reducing the dataset size (it is now about 6MB for the whole .tfrecord) and splitting it locally in two sets (validation and training) since the crash seems to happen when the component should split the dataset, but neither of these changed the situation.
Do you have any idea on why it goes out of memory and what steps could I take to solve this?
Thank you very much.
If an application has a memory leak or tries to use more memory than a set limit amount, Kubernetes will terminate it with an “OOMKilled—Container limit reached” event and Exit Code 137.
When you see a message like this, you have two choices: increase the limit for the pod or start debugging. If, for example, your website was experiencing an increase in load, then adjusting the limit would make sense. On the other hand, if the memory use was sudden or unexpected, it may indicate a memory leak and you should start debugging immediately.
Remember, Kubernetes killing a pod like that is a good thing—it prevents all the other pods from running on the same node.
also refer similar issues link1 and link2,hope it helps.Thanks
There are great but still limited set of SDK access to GCP APIs within cloud function SDKs. e.g. Node.
I want to call gcloud cli within a cloud function. Is this possible? e.g.
gcloud sql instances patch my-database --activation-policy=NEVER
The goal is nightly shutdown of an SQL instance
I believe you should use the Cloud SQL Admin API. If you're using the Python runtime for example you'd had 'google-api-python-client==1.7.8' (for example) to your requirements file and on the respective client library you would use the method instances.patch with the appropriate parameters.
Hope this helps.
Also you have here a working example with the Python runtime, just be sure to edit the 'projid' and 'instance' variables accordingly.
from googleapiclient.discovery import build
service = build('sqladmin', 'v1beta4')
projid = '' #project id where Cloud SQL instance is
instance = '' #Cloud SQL instance
patch = {'settings': {'activationPolicy':'NEVER'}}
req = service.instances().patch(project=projid, instance=instance, body=patch)
x = req.execute()
print(x)
What ways do we have available to connect to a Google Cloud SQL (MySQL) instance from the newly introduced Google Cloud Composer? The intention is to get data from a Cloud SQL instance into BigQuery (perhaps with an intermediary step through Cloud Storage).
Can the Cloud SQL proxy be exposed in some way on pods part the Kubernetes cluster hosting Composer?
If not can the Cloud SQL Proxy be brought in by using the Kubernetes Service Broker? -> https://cloud.google.com/kubernetes-engine/docs/concepts/add-on/service-broker
Should Airflow be used to schedule and call GCP API commands like 1) export mysql table to cloud storage 2) read mysql export into bigquery?
Perhaps there are other methods that I am missing to get this done
"The Cloud SQL Proxy provides secure access to your Cloud SQL Second Generation instances without having to whitelist IP addresses or configure SSL." -Google CloudSQL-Proxy Docs
CloudSQL Proxy seems to be the recommended way to connect to CloudSQL above all others. So in Composer, as of release 1.6.1, we can create a new Kubernetes Pod to run the gcr.io/cloudsql-docker/gce-proxy:latest image, expose it through a service, then create a Connection in Composer to use in the operator.
To get set up:
Follow Google's documentation
Test the connection using info from Arik's Medium Post
Check that the pod was created kubectl get pods --all-namespaces
Check that the service was created kubectl get services --all-namespaces
Jump into a worker node kubectl --namespace=composer-1-6-1-airflow-1-10-1-<some-uid> exec -it airflow-worker-<some-uid> bash
Test mysql connection mysql -u composer -p --host <service-name>.default.svc.cluster.local
Notes:
Composer now uses namespaces to organize pods
Pods in different namespaces don't talk to each other unless you give them the full path <k8-service-name>.<k8-namespace-name>.svc.cluster.local
Creating a new Composer Connection with the full path will enable successful connection
We had the same problem but with a Postgres instance. This is what we did, and got it to work:
create a sqlproxy deployment in the Kubernetes cluster where airflow runs. This was a copy of the existing airflow-sqlproxy used by the default airflow_db connection with the following changes to the deployment file:
replace all instances of airflow-sqlproxy with the new proxy name
edit under 'spec: template: spec: containers: command: -instances', replace the existing instance name with the new instance we want to connect to
create a kubernetes service, again as a copy of the existing airflow-sqlproxy-service with the following changes:
replace all instances of airflow-sqlproxy with the new proxy name
under 'spec: ports', change to the appropriate port (we used 5432 for a Postgres instance)
in the airflow UI, add a connection of type Postgres with host set to the newly created service name.
You can follow these instructions to launch a new Cloud SQL proxy instance in the cluster.
re #3: That sounds like a good plan. There isn't a Cloud SQL to BigQuery operator to my knowledge, so you'd have to do it in two phases like you described.
Adding the medium post in the comments from #Leo to the top level https://medium.com/#ariklevliber/connecting-to-gcp-composer-tasks-to-cloud-sql-7566350c5f53 . Once you follow that article and have the service setup you can connect from your DAG using SQLAlchemy like this:
import os
from datetime import datetime, timedelta
import logging
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
logger = logging.getLogger(os.path.basename(__file__))
INSTANCE_CONNECTION_NAME = "phil-new:us-east1:phil-db"
default_args = {
'start_date': datetime(2019, 7, 16)
}
def connect_to_cloud_sql():
'''
Create a connection to CloudSQL
:return:
'''
import sqlalchemy
try:
PROXY_DB_URL = "mysql+pymysql://<user>:<password>#<cluster_ip>:3306/<dbname>"
logger.info("DB URL", PROXY_DB_URL)
engine = sqlalchemy.create_engine(PROXY_DB_URL, echo=True)
for result in engine.execute("SELECT NOW() as now"):
logger.info(dict(result))
except Exception:
logger.exception("Unable to interact with CloudSQL")
dag = DAG(
dag_id="example_sqlalchemy",
default_args=default_args,
# schedule_interval=timedelta(minutes=5),
catchup=False # If you don't set this then the dag will run according to start date
)
t1 = PythonOperator(
task_id="example_sqlalchemy",
python_callable=connect_to_cloud_sql,
dag=dag
)
if __name__ == "__main__":
connect_to_cloud_sql()
Here, in Hoffa's answer to a similar question, you can find a reference on how Wepay keeps it synchronized every 15 minutes using an Airflow operator.
From said answer:
Take a look at how WePay does this:
https://wecode.wepay.com/posts/bigquery-wepay
The MySQL to GCS operator executes a SELECT query against a MySQL
table. The SELECT pulls all data greater than (or equal to) the last
high watermark. The high watermark is either the primary key of the
table (if the table is append-only), or a modification timestamp
column (if the table receives updates). Again, the SELECT statement
also goes back a bit in time (or rows) to catch potentially dropped
rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL
database every 15 minutes.
Now we can connect to Cloud SQL without creating a cloud proxy ourselves. The operator will create it automatically. The code look like this:
from airflow.models import DAG
from airflow.contrib.operators.gcp_sql_operator import CloudSqlInstanceExportOperator
export_body = {
'exportContext': {
'fileType': 'CSV',
'uri': EXPORT_URI,
'databases': [DB_NAME],
'csvExportOptions': {
'selectQuery': SQL
}
}
}
default_dag_args = {}
with DAG(
'postgres_test',
schedule_interval='#once',
default_args=default_dag_args) as dag:
sql_export_task = CloudSqlInstanceExportOperator(
project_id=GCP_PROJECT_ID,
body=export_body,
instance=INSTANCE_NAME,
task_id='sql_export_task'
)