Is it possible to update existing Google cloud dataflow pipelines when using template for deployment ?

Is it possible to update existing Google cloud dataflow pipelines when using template for deployment ? - google-cloud-platform

When deploying google dataflow pipeline as templates, Is it possible to update the pipeline using another version of the template?
Basically, I am looking for a combination https://cloud.google.com/dataflow/pipelines/updating-a-pipeline with https://cloud.google.com/dataflow/docs/templates/overview

The feature of updating an existing job from template API is not ready yet. (We are working on it).
At the time, you can probably make use our public repo (basically the source code for these templates) to do it. Basically, you can just build and launch the job to "update" the running job from a shell.
https://github.com/GoogleCloudPlatform/DataflowTemplates

Now it's possible to update a template streaming Dataflow job.
Using the REST API, set the update parameter to true:
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://YOUR_BUCKET_NAME/templates/TemplateName
{
"jobName": "JOB_NAME",
"parameters": {
"topic": "projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME",
"table": "YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME"
},
"environment": {
"tempLocation": "gs://YOUR_BUCKET_NAME/temp",
"zone": "us-central1-f"
}
"update": true
}
The update option is not present using gcloud dataflow jobs run.
https://cloud.google.com/dataflow/docs/guides/templates/running-templates#example-3:-updating-a-custom-template-streaming-job
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters

Related

Can you start AI platform jobs from HTTP requests?

I have a web app (react + node.js) running on App Engine.
I would like to kick off (from this web app) a Machine Learning job that requires a GPU (running in a container on AI platform or running on GKE using a GPU node pool like in this tutorial, but we are open to other solutions).
I was thinking of trying what is described at the end of this answer, basically making an HTTP request to start the job using project.job.create API.
More details on the ML job in case this is useful: it generates an output every second that is stored on Cloud Storage and then read in the web app.
I am looking for examples of how to set this up? Where would the job configuration live and how should I set up the API call to kick off that job? Are the there other ways to achieve the same result?
Thank you in advance!

On Google Cloud, all is API, and you can interact with all the product with HTTP request. SO you can definitively achieve what you want.
I personally haven't an example but you have to build a JSON job description and post it to the API.
Don't forget, when you interact with Google Cloud API, you have to add an access token in the Authorization: Bearer header
Where should be your job config description? It depends...
If it is strongly related to your App Engine app, you can add it in App Engine code itself and have it "hard coded". The downside of that option is anytime you have to update the configuration, you have to redeploy a new App Engine version. But if your new version isn't correct, a rollback to a previous and stable version is easy and consistent.
If you prefer to update differently your config file and your App Engine code, you can store the config out of App Engine code, on Cloud Storage for instance. Like that, the update is simple and easy: update the config on Cloud Storage to change the job configuration. However there is no longer relation between the App Engine version and the config version. And the rollback to a stable version can be more difficult.
You can also have a combination of both, where you have a default job configuration in your App Engine code, and an environment variable potentially set to point to a Cloud Storage file that contain a new version of the configuration.
I don't know if it answers all your questions. Don't hesitate to comment if you want more details on some parts.

As mentionated, you can use the AI Platform api to create a job via a post.
Following is an example using Java Script and request to trig a job.
Some usefull tips:
Jobs console to create a job manually, then use the api to list this job then you will have a perfect json example of how to trig it.
You can use the Try this API tool to get the json output of the manually created job. Use this path to get the job: projects/<project name>/jobs/<job name>.
Get the authorization token using the OAuth 2.0 Playground for tests purposes (Step 2 -> Access token:). Check the docs for a definitive way.
Not all parameters are required on the json, thtas jus one example of the job that I have created and got the json using the steps above.
JS Example:
var request = require('request');
request({
url: 'https://content-ml.googleapis.com/v1/projects/<project-name>/jobs?alt=json',
method: 'POST',
headers: {"authorization": "Bearer ya29.A0AR9999999999999999999999999"},
json: {
"jobId": "<job name>",
"trainingInput": {
"scaleTier": "CUSTOM",
"masterType": "standard",
"workerType": "cloud_tpu",
"workerCount": "1",
"args": [
"--training_data_path=gs://<bucket>/*.jpg",
"--validation_data_path=gs://<bucket>/*.jpg",
"--num_classes=2",
"--max_steps=2",
"--train_batch_size=64",
"--num_eval_images=10",
"--model_type=efficientnet-b0",
"--label_smoothing=0.1",
"--weight_decay=0.0001",
"--warmup_learning_rate=0.0001",
"--initial_learning_rate=0.0001",
"--learning_rate_decay_type=cosine",
"--optimizer_type=momentum",
"--optimizer_arguments=momentum=0.9"
],
"region": "us-central1",
"jobDir": "gs://<bucket>",
"masterConfig": {
"imageUri": "gcr.io/cloud-ml-algos/image_classification:latest"
}
},
"trainingOutput": {
"consumedMLUnits": 1.59,
"isBuiltInAlgorithmJob": true,
"builtInAlgorithmOutput": {
"framework": "TENSORFLOW",
"runtimeVersion": "1.15",
"pythonVersion": "3.7"
}
}
}
}, function(error, response, body){
console.log(body);
});
Result:
...
{
createTime: '2022-02-09T17:36:42Z',
state: 'QUEUED',
trainingOutput: {
isBuiltInAlgorithmJob: true,
builtInAlgorithmOutput: {
framework: 'TENSORFLOW',
runtimeVersion: '1.15',
pythonVersion: '3.7'
}
},
etag: '999999aaaac='

Thank you everyone for the input. This was useful to help me resolve my issue, but I wanted to also share the approach I ended up taking:
I started by making sure I could kick off my job manually.
I used this tutorial with a config.yaml file that looked like this:
workerPoolSpecs:
machineSpec:
machineType: n1-standard-4
acceleratorType: NVIDIA_TESLA_T4
acceleratorCount: 1
replicaCount: 1
containerSpec:
imageUri: <Replace this with your container image URI>
args: ["--some=argument"]
When I had a job that could be kicked off manually, I switched to using
the Vertex AI Node.js API to start the job or cancel it. The API exists in other languages.
I know my original question was about HTTP requests, but having an API in the language was a lot easier for me, in particular because I didn't have to worry about authentification.
I hope that is useful, happy to provide mode details if needed.

What are SageMaker pipelines actually?

Sagemaker pipelines are rather unclear to me, I'm not experienced in the field of ML but I'm working on figuring out the pipeline definitions.
I have a few questions:
Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.
Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?
There's also a Python SDK, how does this differ from the CDK and CloudFormation?
I can't seem to find any examples besides the Python SDK usage, how come?
The docs and workshops seem only to properly describe the Python SDK usage,it would be really helpful if someone could clear this up for me!

SageMaker has two things called Pipelines: Model Building Pipelines and Serial Inference Pipelines. I believe you're referring to the former
A model building pipeline defines steps in a machine learning workflow, such as pre-processing, hyperparameter tuning, batch transformations, and setting up endpoints
A serial inference pipeline is two or more SageMaker models run one after the other
A model building pipeline is defined in JSON, and is hosted/run in some sort of proprietary, serverless fashion by SageMaker
Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.
You can create/modify them using the API, which can also be called via the CLI, Python SDK, or CloudFormation. These all use the AWS API under the hood
You can start/stop/view them in SageMaker Studio:
Left-side Navigation bar > SageMaker resources > Drop-down menu > Pipelines
Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?
Unlikely. CodePipeline is more for building and deploying code, not specific to SageMaker. There is no direct integration as far as I can tell, other than that you can start a SM pipeline with CP
There's also a Python SDK, how does this differ from the CDK and CloudFormation?
The Python SDK is a stand-alone library to interact with SageMaker in a developer-friendly fashion. It's more dynamic than CloudFormation. Let's you build pipelines using code. Whereas CloudFormation takes a static JSON string
A very simple example of Python SageMaker SDK usage:
processor = SKLearnProcessor(
framework_version="0.23-1",
instance_count=1,
instance_type="ml.m5.large",
role="role-arn",
)
processing_step = ProcessingStep(
name="processing",
processor=processor,
code="preprocessor.py"
)
pipeline = Pipeline(name="foo", steps=[processing_step])
pipeline.upsert(role_arn = ...)
pipeline.start()
pipeline.definition() produces rather verbose JSON like this:
{
"Version": "2020-12-01",
"Metadata": {},
"Parameters": [],
"PipelineExperimentConfig": {
"ExperimentName": {
"Get": "Execution.PipelineName"
},
"TrialName": {
"Get": "Execution.PipelineExecutionId"
}
},
"Steps": [
{
"Name": "processing",
"Type": "Processing",
"Arguments": {
"ProcessingResources": {
"ClusterConfig": {
"InstanceType": "ml.m5.large",
"InstanceCount": 1,
"VolumeSizeInGB": 30
}
},
"AppSpecification": {
"ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
"ContainerEntrypoint": [
"python3",
"/opt/ml/processing/input/code/preprocessor.py"
]
},
"RoleArn": "arn:aws:iam::123456789012:role/foo",
"ProcessingInputs": [
{
"InputName": "code",
"AppManaged": false,
"S3Input": {
"S3Uri": "s3://bucket/preprocessor.py",
"LocalPath": "/opt/ml/processing/input/code",
"S3DataType": "S3Prefix",
"S3InputMode": "File",
"S3DataDistributionType": "FullyReplicated",
"S3CompressionType": "None"
}
}
]
}
}
]
}
You could use the above JSON with CloudFormation/CDK, but you build the JSON with the SageMaker SDK
You can also define model building workflows using Step Function State Machines, using the Data Science SDK, or Airflow

AWS .NET Core 3.1 Mock Lambda Test Tool - How to set up 2 or more functions for local testing

So I have a very simple aws-lambda-tools-defaults.json in my project:
{
"profile": "default",
"region": "us-east-2",
"configuration": "Release",
"framework": "netcoreapp3.1",
"function-runtime": "dotnetcore3.1",
"function-memory-size": 256,
"function-timeout": 30,
"function-handler": "LaCarte.RestaurantAdmin.EventHandlers::LaCarte.RestaurantAdmin.EventHandlers.Function::FunctionHandler"
}
It works, I can test my lambda code locally which is great. But I want to be able to test multiple lambdas, not just one. Does anyone else know how to change the JSON so that I can run multiple lambdas in the mock tool?
Thanks in advance,

Simply remove the function-handler attribute from your aws-lambda-tools-defaults.json file and add a template attribute referencing your serverless.template (the AWS CloudFormation template used to deploy your lambda functions to your AWS cloud environment)
{
...
"template": "serverless.template"
...
}
Then, you can test you lambda function locally for example with The AWS .NET Mock Lambda Test Tool. So now you'll see the Function dropdown List has changed from listing the lambda function name you specified in your function-handler
to the list of lambda functions declared in your serverless.template file, and then you can test them all locally! :)
You can find more info in this discussion

Answering after a long time, but might help someone else. To deploy and test multiple lambdas from visual studio, you have to implement serverless.template. Check AWS SAM documentation.
You can start with this one - https://docs.aws.amazon.com/toolkit-for-visual-studio/latest/user-guide/lambda-build-test-severless-app.html

Google Dataprep copy flows from one project to another

I have two Google projects: dev and prod. I import data from also different storage buckets located in these projects: dev-bucket and prod-bucket.
After I have made and tested changes in the dev environment, how can I smoothly apply (deploy/copy) the changes to prod as well?
What I do now is I export the flow from devand then re-import it into prod. However, each time I need to manually do the following in the `prod flows:
Change the dataset that serve as inputs in the flow
Replace the manual and scheduled destinations for the right BigQuery dataset (dev-dataset-bigquery and prod-dataset-bigquery)
How can this be done more smoother?

If you want to copy data between Google Cloud Storage (GCS) buckets dev-bucket and prod-bucket, Google provides a Storage Transfer Service with this functionality. https://cloud.google.com/storage-transfer/docs/create-manage-transfer-console You can either manually trigger data to be copied from one bucket to another or have it run on a schedule.
For the second part, it sounds like both dev-dataset-bigquery and prod-dataset-bigquery are loaded from files in GCS? If this is the case, the BigQuery Transfer Service may be of use. https://cloud.google.com/bigquery/docs/cloud-storage-transfer You can trigger a transfer job manually, or have it run on a schedule.
As others have said in the comments, if you need to verify data before initiating transfers from dev to prod, a CI system such as spinnaker may help. If the verification can be automated, a system such as Apache Airflow (running on Cloud Composer, if you want a hosted version) provides more flexibility than the transfer services.

Follow below procedure for movement from one environment to another using API and for updating the dataset and the output as per new environment.
1)Export a plan
GET
https://api.clouddataprep.com/v4/plans/<plan_id>/package
2)Import the plan
Post:
https://api.clouddataprep.com/v4/plans/package
3)Update the input dataset
PUT:
https://api.clouddataprep.com/v4/importedDatasets/<datset_id>
{
"name": "<new_dataset_name>",
"bucket": "<bucket_name>",
"path": "<bucket_file_name>"
}
4)Update the output
PATCH
https://api.clouddataprep.com/v4/outputObjects/<output_id>
{
"publications": [
{
"path": [
"<project_name>",
"<dataset_name>"
],
"tableName": "<table_name>",
"targetType": "bigquery",
"action": "create"
}
]
}

Execute stack after cloudFormation Deploy

I have a ApiGateway made with Serverless-model-application that I made a integration with GitHub via CodePipeline, everything is running fine, the pipeline reads the webhook, builds the buildpsec.yml and deploys the CloudFormation file, creating the updating the stack.
The thing is after the stack is updated it still needs a approval on the console, how can I make the execute on the stack update be auto-run?

It sounds like your pipeline is doing one of two things, unless I'm misunderstanding you:
Making a change set but not executing it in the cloudformation console.
Proceeding to a manual approval step in the pipeline and awaiting your confirmation.
Since #2 is simply solved by removing that step, let's talk about #1.
Assuming you are successfully creating a change set called ChangeSetName, you need a step in your pipeline with the following (cfn JSON template syntax):
"Name": "StepName",
"ActionTypeId": {"Category": "Deploy",
"Owner": "AWS",
"Provider": "CloudFormation",
"Version": "1"
},
"Configuration": {
"ActionMode": "CHANGE_SET_EXECUTE",
"ChangeSetName": {
"Ref": "ChangeSetName"
},
...
Keep the other parameters (e.g. RoleArn) consistent per usual.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js