I have a requirement wherein i need to pass parameters to an Airflow dag (eg. tenantName), and the airflow dag(python operator) will use the tenantName to filter data based on the parameter passed.
i'm able to access the parameters stored in Airflow , using the following code
def greeting():
import logging
logging.info("Hello there, World! {} {}".format(models.Variable.get('tenantName'), datetime.datetime.now()))
However, for my requirement - the tenantName will differ depending on the run.
How do i achieve this ?
Related
I am trying to run a Custom Training Job to deploy my model in Vertex AI directly from a Jupyterlab. This Jupyterlab is instantiated from a Vertex AI Managed Notebook where I already specified the service account.
My aim is to deploy the training script that I specify to the method CustomTrainingJob directly from the cells of my notebook. This would be equivalent to pushing an image that contains my script to container registry and deploying the Training Job manually from the UI of Vertex AI (in this way, by specifying the service account, I was able to corectly deploy the training job). However, I need everything to be executed from the same notebook.
In order to specify the credentials to the CustomTrainingJob of aiplatform, I execute the following cell, where all variables are correctly set:
import google.auth
from google.cloud import aiplatform
from google.auth import impersonated_credentials
source_credentials = google.auth.default()
target_credentials = impersonated_credentials.Credentials(
source_credentials=source_credentials,
target_principal='SERVICE_ACCOUNT.iam.gserviceaccount.com',
target_scopes = ['https://www.googleapis.com/auth/cloud-platform'])
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)
job = aiplatform.CustomTrainingJob(
display_name=JOB_NAME,
script_path=SCRIPT_PATH,
container_uri=MODEL_TRAINING_IMAGE,
credentials=target_credentials
)
When after the job.run() command is executed it seems that the credentials are not correctly set. In particular, the following error is returned:
/opt/conda/lib/python3.7/site-packages/google/auth/impersonated_credentials.py in _update_token(self, request)
254
255 # Refresh our source credentials if it is not valid.
--> 256 if not self._source_credentials.valid:
257 self._source_credentials.refresh(request)
258
AttributeError: 'tuple' object has no attribute 'valid'
I also tried different ways to configure the credentials of my service account but none of them seem to work. In this case it looks like the tuple that contains the source credentials is missing the 'valid' attribute, even if the method google.auth.default() only returns two values.
To run the custom training job using a service account, you could try using the service_account argument for job.run(), instead of trying to set credentials. As long as the notebook executes as a user that has act-as permissions for the chosen service account, this should let you run the custom training job as that service account.
I have an Airflow DAG executing calling a GKEPodOperator where a python script is ran to process and load data to BigQuery. Can the GKEPodOperator return like a python list or dictionary from the executed script back to the DAG so I can make use of that to write custom emails using DAGOperator?
First GKEPodOperator is deprecated. You should use GKEStartPodOperator.
Like other operators you can pass values between tasks with Xcom. Note that the GKEStartPodOperator inherits from KubernetesPodOperator thus the xcom mechanism is different than other operators. It works with launching a sidecar container. You can read more about it with examples here.
Now that you have the desired value stored as XCOM (as string) you want to pull it so you can use it in your custom operator:
Airflow >= 2.1.0 (not yet released)
There is native support for to convert xcom directly into native python object. See PR. You will need to set render_template_as_native_obj=True. You can read more about it in the docs: Rendering Fields as Native Python Objects
Airflow < 2.1.0:
In the follow up task you need to pull the xcom value and convert it to whatever Python object you'd like. You can see example for it here.
I'm new to GCP and Python. I have got a task to import JSON file into google firestore using google cloud functions via Python.
Kindly assist please.
I could achieve this system setup using below code. Posting for your reference:-
CLOUD FUNCTIONS CODE
REQUIREMENTS.TXT (Dependencies)
`google-api-python-client==1.7.11
google-auth==1.6.3
google-auth-httplib2==0.0.3
google-cloud-storage==1.19.1
google-cloud-firestore==1.6.2`
MAIN.PY
from google.cloud import storage
from google.cloud import firestore
import json
client = storage.Client()``
def hello_gcs_generic(data, context):
print('Bucket: {}'.format(data['bucket']))
print('File: {}'.format(data['name']))
bucketValue = data['bucket']
filename = data['name']
print('bucketValue : ',bucketValue)
print('filename : ',filename)
testFile = client.get_bucket(data['bucket']).blob(data['name'])
dataValue = json.loads(testFile.download_as_string(client=None))
print(dataValue)
db = firestore.Client()
doc_ref = db.collection(u'collectionName').document(u'documentName')
doc_ref.set(dataValue)
Cloud functinos are server less functions provided by google. The beauty of cloud function is it will destroy it will invoke any trigger happens and itself once the execution is complete. Cloud functions are single purpose functions. Not only python, you can also use NodeJS and Go to write cloud functions. You can create a cloud function very easily by visiting quick start of cloud functions (https://cloud.google.com/functions/docs/quickstart-console).
Your task is to import a JSON file into google firestore. This part you can do using Firestore python connector like any normal python program and add into cloud function console or upload via gcloud. Still the trigger part is missing here. As I mentioned cloud function is serverless. It will execute when any event happens in the attached trigger. You haven't mentioned any trigger here (when you want to trigger the function). Once you give information about the trigger I can give more insights on resolution.
Ishank Aggarwal, You can add the above code snippet as a part of cloud function by following steps:
https://console.cloud.google.com/functions/
Create function using function name, yours requirements, choose run time as python and choose trigger as your gcs bucket.
Once you create it, if any change happens in your bucket, the function will trigger and execute your code
I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know
I am using Airflow to run Spark jobs on Google Cloud Composer. I need to
Create cluster (YAML parameters supplied by user)
list of spark jobs (job params also supplied by per job YAML)
With the Airflow API - I can read YAML files, and push variables across tasks using xcom.
But, consider the DataprocClusterCreateOperator()
cluster_name
project_id
zone
and a few other arguments are marked as templated.
What if I want to pass in other arguments as templated (which are currently not so)? - like image_version,
num_workers, worker_machine_type etc?
Is there any workaround for this?
Not sure what you mean for 'dynamic', but when yaml file updated, if the reading file process is in dag file body, the dag will be refreshed to apply for the new args from yaml file. So actually, you don't need XCOM to get the arguments.
just simply create a params dictionary then pass to default_args:
CONFIGFILE = os.path.join(
os.path.dirname(os.path.realpath(\__file__)), 'your_yaml_file')
with open(CONFIGFILE, 'r') as ymlfile:
CFG = yaml.load(ymlfile)
default_args = {
'cluster_name': CFG['section_A']['cluster_name'], # edit here according to the structure of your yaml file.
'project_id': CFG['section_A']['project_id'],
'zone': CFG['section_A']['zone'],
'mage_version': CFG['section_A']['image_version'],
'num_workers': CFG['section_A']['num_workers'],
'worker_machine_type': CFG['section_A']['worker_machine_type'],
# you can add all needs params here.
}
DAG = DAG(
dag_id=DAG_NAME,
schedule_interval=SCHEDULE_INTEVAL,
default_args=default_args, # pass the params to DAG environment
)
Task1 = DataprocClusterCreateOperator(
task_id='your_task_id',
dag=DAG
)
But if you want dynamic dags rather than arguments, you may need other strategy like this.
So you probably need to figure out the basic idea:
In which level the dynamics is? Task level? DAG level?
Or you can create your own Operator to do the job and take the parameters.