How to list workers logged onto head - ray

I'm setting up Ray on a kubernetes cluster.
I have started some workers and a head inside some pods.
Is there a way I can list the workers attached to the head, without writing a cluster config file?

open python interpreter
import ray
import pprint
ray.init(redis_address="<cluster head>:<port>")
gsct = ray.global_state.client_table()
pprint.pprint(gsct)

Related

How to connect Janusgraph deployed in GCP with Python runtime?

I have deployed the Janusgraph using Helm in Google cloud Containers, following the below documentation:
https://cloud.google.com/architecture/running-janusgraph-with-bigtable,
I'm able to fire the gremline query using Google Cloud Shell.
Snapshot of GoogleCLoud Shell
Now I want to access the Janusgraph using Python, I tried below line of code but it's unable to connect to Janusgraph inside GCP container.
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
graph = Graph()
g = graph.traversal().withRemote(DriverRemoteConnection('gs://127.0.0.1:8182/gremlin','g'))
value = g.V().has('name','hercules').values('age')
print(value)
here's the output I'm getting
[['V'], ['has', 'name', 'hercules'], ['values', 'age']]
Whereas the output should be -
30
Is there someone tried to access Janusgraph using Python inside GCP.
You need to end the query with a terminal step such as next or toList. What you are seeing is the query bytecode printed as the query was never submitted to the server due to the missing terminal step. So you need something like this:
value = g.V().has('name','hercules').values('age').next()
print(value)

How to transfer files and directories from remote server to GCS buckets

I'm stucked. I have to import files and directories from remote server (proably Linux) to GCS buckets with a given scheduling time (let's say weekly).
There are several folders on the server, one for each project. Inside the project folder there are other folders that have the name of dates (20221015, 20221019, ...). Inside each of these folders there are other subfolders.
I have to migrate an entire folder starting from the date-folder, every week.\
First question: which Google Service is better for my purpose: Storage Transfer Service or Cloud Composer (is there an Operator for this task)?
Second question: how can I avoid to import date-folders that I have already imported (i.e. one week ago)?
Thank you for your help
Using Cloud Composer, you may use SFTPToGCSOperator.
Here is a working DAG on my end:
import os
from airflow import models
from airflow.models import Variable
from airflow.providers.google.cloud.transfers.sftp_to_gcs import SFTPToGCSOperator
from airflow.utils.dates import days_ago
default_args = {"start_date": days_ago(1)}
DIR_path = "<your-path>"
BUCKET_SRC = "<your-bucket>"
with models.DAG(
"dag_sftp_to_gcs", default_args=default_args, schedule_interval=None
) as dag:
copy_sftp_to_gcs = SFTPToGCSOperator(
task_id="t_sftp_to_gcs",
sftp_conn_id="<your-connection>",
gcp_conn_id="google_cloud_default",
source_path=os.path.join(DIR_path, "*.xlsx"), #you may use wildcard to get multiple files
destination_bucket=BUCKET_SRC,
)
copy_sftp_to_gcs
Remember though that you need to allow ssh in your firewall settings. You may also follow this SO post on how to create connections in Airflow.
File in my VM:
Status in Airflow Console:
Transferred file in GCS:
For your second question, you may create an internal shell script on your server that transfers/deletes file that you want to retain or remove so that correct files will be fetched by Airflow.

Dataflow: storage.Client() import error or how to list all blobs of a GCP bucket

I am have an apache-beam==2.3.0 pipeline written using the python SDK that is working with my DirectRunner locally. When I change the runner to DataflowRunner I get an error about 'storage' not being global.
Checking my code I think it's because I am using the credentials stored in my environment. In my python code I just do:
class ExtractBlobs(beam.DoFn):
def process(self, element):
from google.cloud import storage
client = storage.Client()
yield list(client.get_bucket(element).list_blobs(max_results=100))
The real issue is that I need the client so I can then get the bucket so I can then list the blobs. Everything I'm doing here is so I can list the blobs.
So if anyone can either point me the right direction towards using 'storage.Client()' in Dataflow or how to list the blobs of a GCP bucket without needing the client.
Thanks in advance!
[+] What I've read: https://cloud.google.com/docs/authentication/production#auth-cloud-implicit-python
Fixed:
Okay so upon further reading and investigating it turns out I have the required libraries to run my pipeline locally but Dataflow needs to know these in order to download them into the resources it spins up. https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
so all I've done is create a requirements.txt file with my google-cloud-* requirements.
I then spin up my pipeline like this:
python myPipeline.py --requirements_file requirements.txt --save_main_session True
that last flag is to tell it to keep the imports you do in main.

Google Cloud Composer and Google Cloud SQL

What ways do we have available to connect to a Google Cloud SQL (MySQL) instance from the newly introduced Google Cloud Composer? The intention is to get data from a Cloud SQL instance into BigQuery (perhaps with an intermediary step through Cloud Storage).
Can the Cloud SQL proxy be exposed in some way on pods part the Kubernetes cluster hosting Composer?
If not can the Cloud SQL Proxy be brought in by using the Kubernetes Service Broker? -> https://cloud.google.com/kubernetes-engine/docs/concepts/add-on/service-broker
Should Airflow be used to schedule and call GCP API commands like 1) export mysql table to cloud storage 2) read mysql export into bigquery?
Perhaps there are other methods that I am missing to get this done
"The Cloud SQL Proxy provides secure access to your Cloud SQL Second Generation instances without having to whitelist IP addresses or configure SSL." -Google CloudSQL-Proxy Docs
CloudSQL Proxy seems to be the recommended way to connect to CloudSQL above all others. So in Composer, as of release 1.6.1, we can create a new Kubernetes Pod to run the gcr.io/cloudsql-docker/gce-proxy:latest image, expose it through a service, then create a Connection in Composer to use in the operator.
To get set up:
Follow Google's documentation
Test the connection using info from Arik's Medium Post
Check that the pod was created kubectl get pods --all-namespaces
Check that the service was created kubectl get services --all-namespaces
Jump into a worker node kubectl --namespace=composer-1-6-1-airflow-1-10-1-<some-uid> exec -it airflow-worker-<some-uid> bash
Test mysql connection mysql -u composer -p --host <service-name>.default.svc.cluster.local
Notes:
Composer now uses namespaces to organize pods
Pods in different namespaces don't talk to each other unless you give them the full path <k8-service-name>.<k8-namespace-name>.svc.cluster.local
Creating a new Composer Connection with the full path will enable successful connection
We had the same problem but with a Postgres instance. This is what we did, and got it to work:
create a sqlproxy deployment in the Kubernetes cluster where airflow runs. This was a copy of the existing airflow-sqlproxy used by the default airflow_db connection with the following changes to the deployment file:
replace all instances of airflow-sqlproxy with the new proxy name
edit under 'spec: template: spec: containers: command: -instances', replace the existing instance name with the new instance we want to connect to
create a kubernetes service, again as a copy of the existing airflow-sqlproxy-service with the following changes:
replace all instances of airflow-sqlproxy with the new proxy name
under 'spec: ports', change to the appropriate port (we used 5432 for a Postgres instance)
in the airflow UI, add a connection of type Postgres with host set to the newly created service name.
You can follow these instructions to launch a new Cloud SQL proxy instance in the cluster.
re #3: That sounds like a good plan. There isn't a Cloud SQL to BigQuery operator to my knowledge, so you'd have to do it in two phases like you described.
Adding the medium post in the comments from #Leo to the top level https://medium.com/#ariklevliber/connecting-to-gcp-composer-tasks-to-cloud-sql-7566350c5f53 . Once you follow that article and have the service setup you can connect from your DAG using SQLAlchemy like this:
import os
from datetime import datetime, timedelta
import logging
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
logger = logging.getLogger(os.path.basename(__file__))
INSTANCE_CONNECTION_NAME = "phil-new:us-east1:phil-db"
default_args = {
'start_date': datetime(2019, 7, 16)
}
def connect_to_cloud_sql():
'''
Create a connection to CloudSQL
:return:
'''
import sqlalchemy
try:
PROXY_DB_URL = "mysql+pymysql://<user>:<password>#<cluster_ip>:3306/<dbname>"
logger.info("DB URL", PROXY_DB_URL)
engine = sqlalchemy.create_engine(PROXY_DB_URL, echo=True)
for result in engine.execute("SELECT NOW() as now"):
logger.info(dict(result))
except Exception:
logger.exception("Unable to interact with CloudSQL")
dag = DAG(
dag_id="example_sqlalchemy",
default_args=default_args,
# schedule_interval=timedelta(minutes=5),
catchup=False # If you don't set this then the dag will run according to start date
)
t1 = PythonOperator(
task_id="example_sqlalchemy",
python_callable=connect_to_cloud_sql,
dag=dag
)
if __name__ == "__main__":
connect_to_cloud_sql()
Here, in Hoffa's answer to a similar question, you can find a reference on how Wepay keeps it synchronized every 15 minutes using an Airflow operator.
From said answer:
Take a look at how WePay does this:
https://wecode.wepay.com/posts/bigquery-wepay
The MySQL to GCS operator executes a SELECT query against a MySQL
table. The SELECT pulls all data greater than (or equal to) the last
high watermark. The high watermark is either the primary key of the
table (if the table is append-only), or a modification timestamp
column (if the table receives updates). Again, the SELECT statement
also goes back a bit in time (or rows) to catch potentially dropped
rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL
database every 15 minutes.
Now we can connect to Cloud SQL without creating a cloud proxy ourselves. The operator will create it automatically. The code look like this:
from airflow.models import DAG
from airflow.contrib.operators.gcp_sql_operator import CloudSqlInstanceExportOperator
export_body = {
'exportContext': {
'fileType': 'CSV',
'uri': EXPORT_URI,
'databases': [DB_NAME],
'csvExportOptions': {
'selectQuery': SQL
}
}
}
default_dag_args = {}
with DAG(
'postgres_test',
schedule_interval='#once',
default_args=default_dag_args) as dag:
sql_export_task = CloudSqlInstanceExportOperator(
project_id=GCP_PROJECT_ID,
body=export_body,
instance=INSTANCE_NAME,
task_id='sql_export_task'
)

PySpark print to console

When running a PySpark job on the dataproc server like this
gcloud --project <project_name> dataproc jobs submit pyspark --cluster <cluster_name> <python_script>
my print statements don't show up in my terminal.
Is there any way to output data onto the terminal in PySpark when running jobs on the cloud?
Edit: I would like to print/log info from within my transformation. For example:
def print_funct(l):
print(l)
return l
rddData.map(lambda l: print_funct(l)).collect()
Should print every line of data in the RDD rddData.
Doing some digging, I found this answer for logging, however, testing it provides me the results of this question, whose answer states that that logging isn't possible within the transformation
Printing or logging inside of a transform will end up in the Spark executor logs, which can be accessed through your Application's AppMaster or HistoryServer via the YARN ResourceManager Web UI.
You could alternatively collect the information you are printing alongside your output (e.g. in a dict or tuple). You could also stash it away in an accumulator and then print it from the driver.
If you are doing a lot of print statement debugging, you might find it faster to SSH into your master node and use the pyspark REPL or IPython to experiment with your code. This would also allow you to use the --master local flag which would make your print statements appear in stdout.