I want to establish a pipeline connection between the components by passing any kind of data just to make it look like organized like flowchart with arrows. Right now it is like below
Irrespective of whether the docker container generates output or not I would want pass some sample data between the components. However If any changes is required in the docker container code or the .yaml please let me know
KFP Code
import os
from pathlib import Path
import requests
import kfp
#Load the component
component1 = kfp.components.load_component_from_file('comp_typed.yaml')
component2 = kfp.components.load_component_from_file('component2.yaml')
component3 = kfp.components.load_component_from_file('component3.yaml')
component4 = kfp.components.load_component_from_file('component4.yaml')
#Use the component as part of the pipeline
#kfp.dsl.pipeline(name='Document Processing Pipeline', description='Document Processing Pipeline')
def data_passing():
task1 = component1()
task2 = component2(task1.output)
task3 = component3(task2.output)
task4 = component4(task3.output)
comp_typed.yaml code
name: DPC
description: This is an example
implementation:
container:
image: gcr.io/pro1-in-us/dpc_comp1#sha256:3768383b9cd694936ef00464cb1bdc7f48bc4e9bbf08bde50ac7346f25be15de
command: [python3, /dpc_comp1.py,]
component2.yaml
name: Custom_Plugin_1
description: This is an example
implementation:
container:
image: gcr.io/pro1-in-us/plugin1#sha256:16cb4aa9edf59bdf138177d41d46fcb493f84ce798781125dc7777ff5e1602e3
command: [python3, /plugin1.py,]
I tried this and this but could not achieve anything except for error. I am new to python and kubeflow. What code changes should I make to pass data between all 4 components using KFP SDK. Data can be a file/string
Let's Suppose, Component 1 downloads a .pdf file from gs bucket can i feed the same file into next downstream component?. Component 1 downloads file to '/tmp/doc_pages' location of component 1 docker container which i believe is local to that particular contain and the down stream components can not read them?
This notebook, which describes how to pass data between KFP components, may be useful. It includes the concept of 'small data', to pass directly; vs 'large data' that you write to a file, then— as shown in the example notebook— the paths for the input and output files are chosen by the system and are passed into the function (as strings).
If you don't need to pass data between steps, but want to specify a step ordering dependency (e.g. op2 doesn't run until op1 is finished) you can indicate this in your pipeline definition like so:
op2.after(op1)
In addition to the Amy's excellent answer:
Your pipeline is correct. The best way to establish a dependency between components is to establish data dependency.
Let's look at your pipeline code:
task2 = component2(task1.output)
You're passing output of task1 to component2. This should result in a dependency that you want. But there are couple of problems (and your pipeline will show compilation errors if you try to compile it):
component1 needs to have an output
component2 needs to have an input
component2 needs to have an output (so that you can pass it to component3)
Etc.
Let's add them:
name: DPC
description: This is an example
outputs:
- name: output_1
implementation:
container:
image: gcr.io/pro1-in-us/dpc_comp1#sha256:3768383b9cd694936ef00464cb1bdc7f48bc4e9bbf08bde50ac7346f25be15de
command: [python3, /dpc_comp1.py, --output-1-path, {outputPath: output_1}]
name: Custom_Plugin_1
description: This is an example
inputs:
- name: input_1
outputs:
- name: output_1
implementation:
container:
image: gcr.io/pro1-in-us/plugin1#sha256:16cb4aa9edf59bdf138177d41d46fcb493f84ce798781125dc7777ff5e1602e3
command: [python3, /plugin1.py, --input-1-path, {inputPath: input_1}, --output-1-path, {outputPath: output_1}]
With these changes, your pipeline should compile and display the dependencies that you want.
Please check the tutorial about creating components from command-line programs.
If you don't want to use dependency through outputs or passing any data between components, you can refer to PVC in previous step to explicitly call out a dependency.
Example:
You can create a PVC for storing data.
vop = dsl.VolumeOp(name="pvc",
resource_name="pvc", size=<size>,
modes=dsl.VOLUME_MODE_RWO,)
Use it in a component:
download = dsl.ContainerOp(name="download",image="",
command=[" "], arguments=[" "],
pvolumes={"/data": vop.volume},)
Now you can call out dependency between download and train as follows:
train = dsl.ContainerOp(name="train",image="",
command=[" "], arguments=[" "],
pvolumes={"/data": download.pvolumes["/data"]},)
Related
Idea:
To use experiments and trials to log the training parameters and artifacts in sagemaker while using MWAA as the pipeline orchestrator
I am using the training_config to create the dict to pass the training configuration to the Tensorflow estimator, but there is no parameter to pass the experiment configuration
tf_estimator = TensorFlow(entry_point='train_model.py',
source_dir= source
role=sagemaker.get_execution_role(),
instance_count=1,
framework_version='2.3.0',
instance_type=instance_type,
py_version='py37',
script_mode=True,
enable_sagemaker_metrics = True,
metric_definitions=metric_definitions,
output_path=output
model_training_config = training_config(
estimator=tf_estimator,
inputs=input
job_name=training_jobname,
)
training_task = SageMakerTrainingOperator(
task_id=test_id,
config=model_training_config,
aws_conn_id="airflow-sagemaker",
print_log=True,
wait_for_completion=True,
check_interval=60
)
You can use the experiment_config in estimator.fit. More detailed example can be found here
The only way that i found right now is to use the CreateTrainigJob API (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-RoleArn). The following steps are needed:
I am not sure if this will work with Bring your own script method for E.g with a Tensorflow estimator
it works with a build your own container approach
Using the CreateTrainigJob API i created the configs which in turn includes all the needed configs like - training, experiment, algporthm etc and passed that to SagemakerTrainingOperator
We are trying to return some metrics from our Vertex Pipeline, such that they are visible in the Run Comparison and Metadata tools in the Vertex UI.
I saw here that we can use this output type Output[Metrics], and the subsequent metrics.log_metric("metric_name", metric_val) method to add the metrics, and it seemed from the available documentation that this would be enough.
We want to use the reusable component method as opposed to python function based components, around which the example is based. So we implemented it within our component code like so:
We added the output in the component.yaml:
outputs:
- name: metrics
type: Metrics
description: evaluation metrics path
then added the output to the command in the implemenation:
command: [
python3, main.py,
--gcs-test-data-path, {inputValue: gcs_test_data_path},
--gcs-model-path, {inputValue: gcs_model_path},
--gcs-output-bucket-id, {inputValue: gcs_output_bucket_id},
--project-id, {inputValue: project_id},
--timestamp, {inputValue: timestamp},
--batch-size, {inputValue: batch_size},
--img-height, {inputValue: img_height},
--img-width, {inputValue: img_width},
--img-depth, {inputValue: img_depth},
--metrics, {outputPath: metrics},
]
Next in the components main python script, we parse this argument with argparse:
PARSER.add_argument('--metrics',
type=Metrics,
required=False,
help='evaluation metrics output')
and pass it to the components main function:
if __name__ == '__main__':
ARGS = PARSER.parse_args()
evaluation(gcs_test_data_path=ARGS.gcs_test_data_path,
gcs_model_path=ARGS.gcs_model_path,
gcs_output_bucket_id=ARGS.gcs_output_bucket_id,
project_id=ARGS.project_id,
timestamp=ARGS.timestamp,
batch_size=ARGS.batch_size,
img_height=ARGS.img_height,
img_width=ARGS.img_width,
img_depth=ARGS.img_depth,
metrics=ARGS.metrics,
)
in the declaration of the component function, we then typed this metrics parameter as Output[Metrics]
from kfp.v2.dsl import Output, Metrics
def evaluation(gcs_test_data_path: str,
gcs_model_path: str,
gcs_output_bucket_id: str,
metrics: Output[Metrics],
project_id: str,
timestamp: str,
batch_size: int,
img_height: int,
img_width: int,
img_depth: int):
finally, we implement the log_metric method within this evaluation function:
metrics.log_metric('accuracy', acc)
metrics.log_metric('precision', prec)
metrics.log_metric('recall', recall)
metrics.log_metric('f1-score', f_1)
When we run this pipeline, we can see this metric artifact materialised in the DAG:
And Metrics Artifacts are listed in the Metadata UI in Vertex:
However, clicking through to view the artifacts JSON, there is no Metadata listed:
In addition, No Metadata is visible when comparing runs in the pipeline UI:
Finally, navigating to the Objects URI in GCS, we are met with 'Requested entity was not found.', which I assume indicates that nothing was written to GCS:
Are we doing something wrong with this implementation of metrics in the reusable components? From what I can tell, this all seems right to me, but it's hard to tell given the docs at this point seem to focus primarily on examples with Python Function based components.
Do we perhaps need to proactively write this Metrics object to an OutputPath?
Any helps is appreciated.
----- UPDATE ----
I have since been able to get artifact metadata and URI To update. In the end we used kfp sdk to generate a yaml file based on a #component decorated python function, we then adapted this format for our reusable components.
Our component.yaml now looks like this:
name: predict
description: Prepare and create predictions request
implementation:
container:
args:
- --executor_input
- executorInput: null
- --function_to_execute
- predict
command:
- python3
- -m
- kfp.v2.components.executor_main
- --component_module_path
- predict.py
image: gcr.io/PROJECT_ID/kfp/components/predict:latest
inputs:
- name: input_1
type: String
- name: intput_2
type: String
outputs:
- name: output_1
type: Dataset
- name: output_2
type: Dataset
with this change to the yaml, we can now successfully update the artifacts metadata dictionary, and uri through artifact.path = '/path/to/file'. These updates are displayed in the Vertex UI.
I am still unsure why the component.yaml format specified in the Kubeflow documentation does not work - I think this may be a bug with Vertex Pipelines.
As I can see in the code you are running, everything should work without a problem; but, as you commented, I would recommend you to write the metrics object into a path so that it can reach somewhere within your project.
I am attempting to add a new custom RSU module (extending AdHocHost) into the Veins_Inet example. Here is my updated scenario (with 1 RSU).
network TestScenario {
submodules:
radioMedium: Ieee80211ScalarRadioMedium;
manager: VeinsInetManager;
node[0]: VeinsInetCar;
// added rsu
rsu: VeinsInetRSU;
connections allowunconnected:}
I also updated the ini file so that the RSU mobility is
*.rsu.mobility.typename = "inet.mobility.static.StationaryMobility"
and the RSU application is barebones with minor implementation:
*.rsu.app[0].typename = "practice.veins_inet.VeinsInetRSUSampleApplication".
However, I get the following error:
TraCIMobility::getExternalId called with no external id set yet.
In the example, the VeinsInetManager is managing the cars with TRACI. Here is the ned file associated with the manager. The source file has 2 functions, pre-initialize module and update module position.
simple VeinsInetManager extends TraCIScenarioManagerLaunchd {
parameters:
#class(veins::VeinsInetManager);}
How can I add a custom module into the scenario without raising any errors?
Your application might be inheriting from VeinsInetApplicationBase, which calls TraCI methods (that fail for nodes that are not a TraCI-managed vehicle). See also its source code.
To be doubly-sure, run your simulation in debug mode, turn on debug-on-errors, and check the stack trace to see where the call is coming from.
I have a Google App Engine Standard Environment application that has been working fine for a year or more, that, quite suddenly, refuses to enqueue new deferred tasks using deferred.defer.
Here's the Python 2.7 code that is making the deferred call:
# Find any inventory items that reference the product, and change them too.
# because this could take some time, we'll do it as a deferred task, and only
# if needed.
if upd:
updater = deferredtasks.InvUpdate()
deferred.defer(updater.run, product_key)
My app.yaml file has the necessary bits to support deferred.defer:
- url: /_ah/queue/deferred
script: google.appengine.ext.deferred.deferred.application
login: admin
builtins:
- deferred: on
And my deferred task has logging in it so I should see it running when it does:
#-------------------------------------------------------------------------------
# DEFERRED routine that updates the inventory items for a particular product. Should be callecd
# when ANY changes are made to the product, because it should trigger a re-download of the
# inventory record for that product to the iPad.
#-------------------------------------------------------------------------------
class InvUpdate(object):
def __init__(self):
self.to_put = []
self.product_key = None
self.updcount = 0
def run(self, product_key, batch_size=100):
updproduct = product_key.get()
if not updproduct:
logging.error("DEFERRED ERROR: Product key passed in does not exist")
return
logging.info(u"DEFERRED BEGIN: beginning inventory update for: {}".format(updproduct.name))
self.product_key = product_key
self._continue(None, batch_size)
...
When I run this in the development environment on my development box, everything works fine. Once I deploy it to the App Engine server, the inventory updates never get done (i.e. the deferred task is not executed), and there are no errors (and no other logging from the deferred task in fact) in the log files on the server. I know that with the sudden move to get everybody on Python 3 as quickly as possible, the deferred.defer library has been marked as not recommended because it only works with the 2.7 Python environment, and I planned on moving to task queues for this, but I wasn't expecting deferred.defer to suddenly stop working in the existing python environment.
Any insight would be greatly appreciated!
I'm pretty sure you cant pass the method of an instance to appengine taskqueue, because that instance will not get exist when your task runs since it will be running in a different process. I actually dont understand how your task ever worked when running remotely in the first place (and running locally is not an accurate representation of how things will run remotely)
Try changing your code to this:
if upd:
deferred.defer(deferredtasks.InvUpdate.run_cls, product_key)
and then InvUpdate is the same but has a new function run_cls:
class InvUpdate(object):
#classmethod
def run_cls(cls, product_key):
cls().run(product_key)
And I'm still on the process of migrating to cloud tasks and my deferred tasks still work
I ran into an apparent circular dependency trying to use log data for TensorBoard during a hyper-parameter search done with Keras Tuner, for a model built with TF2. The typical setup for the latter needs to set up the Tensorboard callback in the tuner's search() method, which wraps the model's fit() method.
from kerastuner.tuners import RandomSearch
tuner = RandomSearch(build_model, #this method builds the model
hyperparameters=hp, objective='val_accuracy')
tuner.search(x=train_x, y=train_y,
validation_data=(val_x, val_y),
callbacks=[tensorboard_cb]
In practice, the tensorboard_cb callback method needs to set up the directory where data will be logged and this directory has to be unique to each trial. A common way is to do this by naming the directory based on the current timestamp, with code like below.
log_dir = time.strftime('trial_%Y_%m_%d-%H_%M_%S')
tensorboard_cb = TensorBoard(log_dir)
This works when training a model with known hyper-parameters. However, when doing hyper-parameters search, I have to define and specify the TensorBoard callback before invoking tuner.search(). This is the problem: tuner.search() will invoke build_model() multiple times and each of these trials should have its own TensorBoard directory. Ideally defining log_dir will be done inside build_model() but the Keras Tuner search API forces the TensorBoard to be defined outside of that function.
TL;DR: TensorBoard gets data through a callback and requires one log directory per trial, but Keras Tuner requires defining the callback once for the entire search, before performing it, not per trial. How can unique directories per trial be defined in this case?
The keras tuner creates a subdir for each run (statement is probably version dependent).
I guess finding the right version mix is of importance.
Here is how it works for me, in jupyterlab.
prerequisite:
pip requirements
keras-tuner==1.0.1
tensorboard==2.1.1
tensorflow==2.1.0
Keras==2.2.4
jupyterlab==1.1.4
(2.) jupyterlab installed, built and running [standard compile arguments: production:minimize]
Here is the actual code. First i define the log folder and the callback
# run parameter
log_dir = "logs/" + datetime.datetime.now().strftime("%m%d-%H%M")
# training meta
stop_callback = EarlyStopping(
monitor='loss', patience=1, verbose=0, mode='auto')
hist_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
histogram_freq=1,
embeddings_freq=1,
write_graph=True,
update_freq='batch')
print("log_dir", log_dir)
Then i define my hypermodel, which i do not want to disclose. Afterwards
i set up the hyper parameter search
from kerastuner.tuners import Hyperband
hypermodel = get_my_hpyermodel()
tuner = Hyperband(
hypermodel
max_epochs=40,
objective='loss',
executions_per_trial=5,
directory=log_dir,
project_name='test'
)
which i then execute
tuner.search(
train_data,
labels,
epochs=10,
validation_data=(val_data, val_labels),
callbacks=[hist_callback],
use_multiprocessing=True)
tuner.search_space_summary()
While the notebook with this code searches for adequate hyper parameters i control the loss in another notebook. Since tf V2 tensorboard can be called via a magic function
Cell 1
import tensorboard
Cell 2
%load_ext tensorboard
Cell 3
%tensorboard --logdir 'logs/'
Sitenote: Since i run jupyterlab in a docker container i have to specifiy the appropriate address and port for tensorboard and also forward this in the dockerfile.
The result is not really predictable for me... I did not understand yet, when i can expect histograms and distributions in tensorboard.
Some runs the loading time seems really excessive... so have patience
Under scalars i find a list of the turns as follows
"logdir"/"model_has"/execution[iter]/[train/validation]
E.g.
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/train
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/validation