I am using Airflow to run Spark jobs on Google Cloud Composer. I need to
Create cluster (YAML parameters supplied by user)
list of spark jobs (job params also supplied by per job YAML)
With the Airflow API - I can read YAML files, and push variables across tasks using xcom.
But, consider the DataprocClusterCreateOperator()
cluster_name
project_id
zone
and a few other arguments are marked as templated.
What if I want to pass in other arguments as templated (which are currently not so)? - like image_version,
num_workers, worker_machine_type etc?
Is there any workaround for this?
Not sure what you mean for 'dynamic', but when yaml file updated, if the reading file process is in dag file body, the dag will be refreshed to apply for the new args from yaml file. So actually, you don't need XCOM to get the arguments.
just simply create a params dictionary then pass to default_args:
CONFIGFILE = os.path.join(
os.path.dirname(os.path.realpath(\__file__)), 'your_yaml_file')
with open(CONFIGFILE, 'r') as ymlfile:
CFG = yaml.load(ymlfile)
default_args = {
'cluster_name': CFG['section_A']['cluster_name'], # edit here according to the structure of your yaml file.
'project_id': CFG['section_A']['project_id'],
'zone': CFG['section_A']['zone'],
'mage_version': CFG['section_A']['image_version'],
'num_workers': CFG['section_A']['num_workers'],
'worker_machine_type': CFG['section_A']['worker_machine_type'],
# you can add all needs params here.
}
DAG = DAG(
dag_id=DAG_NAME,
schedule_interval=SCHEDULE_INTEVAL,
default_args=default_args, # pass the params to DAG environment
)
Task1 = DataprocClusterCreateOperator(
task_id='your_task_id',
dag=DAG
)
But if you want dynamic dags rather than arguments, you may need other strategy like this.
So you probably need to figure out the basic idea:
In which level the dynamics is? Task level? DAG level?
Or you can create your own Operator to do the job and take the parameters.
Related
I have created a docker image which has Entrypoint as processing.py. This script is taking data from /opt/ml/processing/input and after processing putting it /opt/ml/processing/output folder.
For processing the data I should put the file in /opt/ml/processing/input from s3 and then pick processed file from /opt/ml/processing/output into S3.
Following script in sagemaker is doing it properly:
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput
import sagemaker
input_data = 's3://sagemaker-ap-south-1-057036842446/sagemaker/Data/Training/Churn_Modelling.csv'
output_dir = 's3://sagemaker-ap-south-1-057036842446/sagemaker/Outputs/'
image_uri = '057036842446.dkr.ecr.ap-south-1.amazonaws.com/aws-docker-repo:latest'
aws_role = sagemaker.get_execution_role()
processor = Processor(image_uri= image_uri, role=aws_role, instance_count=1, instance_type="ml.m5.xlarge")
processor.run(
inputs=[
ProcessingInput(
source=input_data,
destination='/opt/ml/processing/input'
)
],
outputs=[
ProcessingOutput(
source='/opt/ml/processing/output',
destination=output_dir
)
]
)
Could someone please guide how this can be executed with lambda function? It is not recognizing sagemaker package, second there is a challenge in placing file before the script execution and pick processed files.
I am trying codepipeline to automate this operation. However got no success on that.
Not sure how to put image from S3 into folders internally used by script
I need to know how S3 processing step which pick data from /opt/ml/processing/input
If you want to kick off a Processing Job from Lambda you can use boto3 to make the CreateProcessingJob API call:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_processing_job
I would suggest creating the Job as you have been doing using the SageMaker SDK. Once created, you can describe the Job using the DescribeProcessingJob API call:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.describe_processing_job
You can then use the information from the DescribeProcessingJob API call output to fill out the CreateProcessingJob in Lambda.
I have a requirement wherein i need to pass parameters to an Airflow dag (eg. tenantName), and the airflow dag(python operator) will use the tenantName to filter data based on the parameter passed.
i'm able to access the parameters stored in Airflow , using the following code
def greeting():
import logging
logging.info("Hello there, World! {} {}".format(models.Variable.get('tenantName'), datetime.datetime.now()))
However, for my requirement - the tenantName will differ depending on the run.
How do i achieve this ?
Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.
Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw
Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,
I have created a Vertex AI pipeline similar to this.
Now the pipeline has reference to a csv file. So if this csv file changes the pipeline needs to be recreated.
Is there any way to pass a new csv as a parameter to the pipeline when it is re-run? That is without recreating the pipeline using the notebook?
If not, is there a best practice way of auto updating the dataset, model and deployment?
Have a look to that documentation.
You can define your pipeline like that
...
# Define the workflow of the pipeline.
#kfp.dsl.pipeline(
name="automl-image-training-v2",
pipeline_root=pipeline_root_path)
def pipeline(project_id: str):
...
(you have something very similar in your notebook sample)
Then, when you invoke your pipeline, you can pass some parameter
import google.cloud.aiplatform as aip
job = aip.PipelineJob(
display_name="automl-image-training-v2",
template_path="image_classif_pipeline.json",
pipeline_root=pipeline_root_path,
parameter_values={
'project_id': project_id
}
)
job.submit()
You can see the project_id a dict parameter in the parameter values, and in parameter of your pipeline function.
Do the same for your CSV file name!
I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know