How to update GCP Scheduler Jobs with Python

How to update GCP Scheduler Jobs with Python - google-cloud-platform

I'm working in this project to automate updates in Cloud Scheduler Jobs with Python.
I already wrote the logic in Python but I'm facing one problem, it looks like that to update a Cloud Scheduler job with Python is similar to create a job, you have to past most of the properties of the job in the code, that is the problem, I only want to update the retry_config, nothing else. I want to leave the schedule and the target as it is, so I don't have to past those again every time.
Of course I can get the current schedule and target of the job using another class as GetJobRequest for example, that wouldn't be a problem, but I wish I didn't have to, since I don't want to update those fields.
Help?
from google.cloud import scheduler_v1
from google.protobuf import duration_pb2
client = scheduler_v1.CloudSchedulerClient()
retry_config = scheduler_v1.RetryConfig()
retry_config.retry_count = 4
retry_config.max_doublings = 4
retry_config.min_backoff_duration = duration_pb2.Duration(seconds=5)
retry_config.max_backoff_duration = duration_pb2.Duration(seconds=60)
job = scheduler_v1.Job()
job.name = f"projects/{PROJECT_ID}/locations/{DATAFLOW_REGION}/jobs/test"
job.retry_config = retry_config
job.schedule = "* * * * 1"
method = scheduler_v1.HttpMethod(2)
target = scheduler_v1.HttpTarget()
target.uri = "https://xxxx"
target.http_method = method
job.http_target = target
request = scheduler_v1.UpdateJobRequest(
job=job
)
response = client.update_job(request=request)
print(response)

It is possible to specify the properties that need to be changed using the update_mask parameter.
The final code will be as follows:
from google.cloud import scheduler_v1
from google.protobuf import duration_pb2, field_mask_pb2
client = scheduler_v1.CloudSchedulerClient()
retry_config = scheduler_v1.RetryConfig()
retry_config.retry_count = 4
retry_config.max_doublings = 4
retry_config.min_backoff_duration = duration_pb2.Duration(seconds=5)
retry_config.max_backoff_duration = duration_pb2.Duration(seconds=60)
job = scheduler_v1.Job()
job.name = f"projects/{PROJECT_ID}/locations/{DATAFLOW_REGION}/jobs/test"
job.retry_config = retry_config
update_mask = field_mask_pb2.FieldMask(paths=['retry_config'])
request = scheduler_v1.UpdateJobRequest(
job=job,
update_mask=update_mask
)
response = client.update_job(request=request)
print(response)

Related

Fastest way to ingest data from BigQuery to PubSub

At the moment I am going through the GCP docs trying to figure out what is the optimal/fastest way to ingest data from BigQuery (using Python) to PubSub. What I am doing so far (in a simplified way) is:
MESSAGE_SIZE_IN_BYTES = 500
MAX_BATCH_MESSAGES = 20
MAX_BYTES_BATCH = MESSAGE_SIZE_IN_BYTES * MAX_BATCH_MESSAGES
BATCH_MAX_LATENCY_IN_10MS = 0.01
MAX_FLOW_MESSAGES = 20
MAX_FLOW_BYTES = MESSAGE_SIZE_IN_BYTES * MAX_FLOW_MESSAGES
batch_settings = pubsub_v1.types.BatchSettings(
max_messages=MAX_BATCH_MESSAGES,
max_bytes=MAX_BYTES_BATCH,
max_latency=BATCH_MAX_LATENCY_IN_10MS,
)
publisher_options = pubsub_v1.types.PublisherOptions(
flow_control=pubsub_v1.types.PublishFlowControl(
message_limit=MAX_FLOW_MESSAGES,
byte_limit=MAX_FLOW_BYTES,
limit_exceeded_behavior=pubsub_v1.types.LimitExceededBehavior.BLOCK,
),
)
pubsub_client = pubsub_v1.PublisherClient(credentials=credentials,
batch_settings=self.batch_settings,
publisher_options=self.publisher_options)
bigquery_client = ....
bq_query_job = bigquery_client.query(QUERY)
rows = bq_query_job.result()
for row in rows:
callback_obj = PubsubCallback(...)
json_data = json.dumps(row).encode("utf-8")
publish_future = pubsub_client.publish(topic_path, json_data)
publish_future.add_done_callback(callback_obj.callback)
publish_futures.append(publish_future)
so one message per row. I have being trying to tweak different params for the PubSub publisher client etc, but I cannot get further than 20/30 messages(rows) per second. Is there a way to read from BigQuery using Pubsub in a faster way (at least 1000 times faster than now)?

We also have a need to get data from BigQuery into PubSub and we do so using Dataflow. I've just looked at one of the jobs we ran today and we loaded 3.4million rows in about 5 minutes (so ~11000 rows per second).
Our Dataflow jobs are written in java but you could write them in python if you wish. Here is the code for the pipeline I described above:
package com.ourcompany.pipelines;
import com.google.api.services.bigquery.model.TableRow;
import java.util.HashMap;
import java.util.Map;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* The {#code BigQueryEventReplayer} pipeline runs a supplied SQL query
* against BigQuery, and sends the results one-by-one to PubSub
* The query MUST return a column named 'json', it is this column
* (and ONLY this column) that will be sent onward. The column must be a String type
* and should be valid JSON.
*/
public class BigQueryEventReplayer {
private static final Logger logger = LoggerFactory.getLogger(BigQueryEventReplayer.class);
/**
* Options for the BigQueryEventReplayer. See descriptions for more info
*/
public interface Options extends PipelineOptions {
#Description("SQL query to be run."
+ "An SQL string literal which will be run 'as is'")
#Required
ValueProvider<String> getBigQuerySql();
void setBigQuerySql(ValueProvider<String> value);
#Description("The name of the topic which data should be published to. "
+ "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
#Required
ValueProvider<String> getOutputTopic();
void setOutputTopic(ValueProvider<String> value);
#Description("The ID of the BigQuery dataset targeted by the event")
#Required
ValueProvider<String> getBigQueryTargetDataset();
void setBigQueryTargetDataset(ValueProvider<String> value);
#Description("The ID of the BigQuery table targeted by the event")
#Required
ValueProvider<String> getBigQueryTargetTable();
void setBigQueryTargetTable(ValueProvider<String> value);
#Description("The SourceSystem attribute of the event")
#Required
ValueProvider<String> getSourceSystem();
void setSourceSystem(ValueProvider<String> value);
}
/**
* Takes the data from the TableRow and prepares it for the PubSub, including
* adding attributes to ensure the payload is routed correctly.
*/
public static class MapQueryToPubsub extends DoFn<TableRow, PubsubMessage> {
private final ValueProvider<String> targetDataset;
private final ValueProvider<String> targetTable;
private final ValueProvider<String> sourceSystem;
MapQueryToPubsub(
ValueProvider<String> targetDataset,
ValueProvider<String> targetTable,
ValueProvider<String> sourceSystem) {
this.targetDataset = targetDataset;
this.targetTable = targetTable;
this.sourceSystem = sourceSystem;
}
/**
* Entry point of DoFn for Dataflow.
*/
#ProcessElement
public void processElement(ProcessContext c) {
TableRow row = c.element();
if (!row.containsKey("json")) {
logger.warn("table does not contain column named 'json'");
}
Map<String, String> attributes = new HashMap<>();
attributes.put("sourceSystem", sourceSystem.get());
attributes.put("targetDataset", targetDataset.get());
attributes.put("targetTable", targetTable.get());
String json = (String) row.get("json");
c.output(new PubsubMessage(json.getBytes(), attributes));
}
}
/**
* Run the pipeline. This is the entrypoint for running 'locally'
*/
public static void main(String[] args) {
// Parse the user options passed from the command-line
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
run(options);
}
/**
* Run the pipeline. This is the entrypoint that GCP will use
*/
public static PipelineResult run(Options options) {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read from BigQuery query",
BigQueryIO.readTableRows().fromQuery(options.getBigQuerySql()).usingStandardSql().withoutValidation()
.withTemplateCompatibility())
.apply("Map data to PubsubMessage",
ParDo.of(
new MapQueryToPubsub(
options.getBigQueryTargetDataset(),
options.getBigQueryTargetTable(),
options.getSourceSystem()
)
)
)
.apply("Write message to PubSub", PubsubIO.writeMessages().to(options.getOutputTopic()));
return pipeline.run();
}
}
This pipeline requires that each row retrieved from BigQuery is a JSON document, something that can easily be achieved using TO_JSON_STRING.
I know this might look rather daunting to some (it kinda does to me I admit) but it will get you the throughput that you require!
You can ignore this part:
Map<String, String> attributes = new HashMap<>();
attributes.put("sourceSystem", sourceSystem.get());
attributes.put("targetDataset", targetDataset.get());
attributes.put("targetTable", targetTable.get());
that's just some extra attributes we add to the pubsub message purely for our own use.

Use Pub/Sub Batch Messages. This allows your code to batch multiple messages into a single call to the Pub/Sub service.
Example code from Google (link):
from concurrent import futures
from google.cloud import pubsub_v1
# TODO(developer)
# project_id = "your-project-id"
# topic_id = "your-topic-id"
# Configure the batch to publish as soon as there are 10 messages
# or 1 KiB of data, or 1 second has passed.
batch_settings = pubsub_v1.types.BatchSettings(
max_messages=10, # default 100
max_bytes=1024, # default 1 MB
max_latency=1, # default 10 ms
)
publisher = pubsub_v1.PublisherClient(batch_settings)
topic_path = publisher.topic_path(project_id, topic_id)
publish_futures = []
# Resolve the publish future in a separate thread.
def callback(future: pubsub_v1.publisher.futures.Future) -> None:
message_id = future.result()
print(message_id)
for n in range(1, 10):
data_str = f"Message number {n}"
# Data must be a bytestring
data = data_str.encode("utf-8")
publish_future = publisher.publish(topic_path, data)
# Non-blocking. Allow the publisher client to batch multiple messages.
publish_future.add_done_callback(callback)
publish_futures.append(publish_future)
futures.wait(publish_futures, return_when=futures.ALL_COMPLETED)
print(f"Published messages with batch settings to {topic_path}.")

AWS LEX: Slot update, intent update and then new publishing bot through a Lambda function

I am writing a lambda function that has an array of words that I want to put into a slotType, basically updating it every time. Here is how it goes. Initially, the slotType has values ['car', 'bus']. Next time I run the lambda function the values get updated to ['car', 'bus', 'train', 'flight'] which is basically after appending a new array into the old one.
I want to know how I publish the bot every time the Lambda function gets invoked so the next time I hit the lex bot from the front-end, it uses the latest slotType in the intent and newly published bot alias. Yep, also the alias!
I know for a fact that the put_slot_type() is working because the slot is getting updated in the bot.
Here is the function which basically takes in new labels as parameters.
def lex_extend_slots(new_labels):
print('entering lex model...')
lex = boto3.client('lex-models')
slot_name = 'keysDb'
intent_name = 'searchKeys'
bot_name = 'photosBot'
res = lex.get_slot_type(
name = slot_name,
version = '$LATEST'
)
current_labels = res['enumerationValues']
latest_checksum = res['checksum']
arr = [x['value'] for x in current_labels]
labels = arr + new_labels
print('arry: ', arr)
print('new_labels', new_labels)
print('labels in lex: ', labels)
labels = list(set(labels))
enumerationList = [{'value': label, 'synonyms': []} for label in labels]
print('getting ready to push enum..: ', enumerationList)
res_slot = lex.put_slot_type(
name = slot_name,
description = 'updated slots...',
enumerationValues = enumerationList,
valueSelectionStrategy = 'TOP_RESOLUTION',
)
res_build_intent = lex.create_intent_version(
name = intent_name
)
res_build_bot = lex.create_bot_version(
name = bot_name,
checksum = latest_checksum
)
return current_labels

It looks like you're using Version 1 of the Lex Models API on Boto3.
You can use the put_bot method in the lex-models client to effectively create or update your Lex bot.
The put_bot method expects the full list of intents to be used for building the bot.
It is worth mentioning that you will first need to use put_intent to update your intents to ensure they use the latest version of your updated slotType.
Here's the documentation for put_intent.
The appropriate methods for creating and updating aliases are contained in the same link that I've shared above.

ray restore checkpoint in rllib

Ray saves a bunch of checkpoints during a call of agent.train(). How do I know which one is the checkpoint with the best agent to load?
Is there any function like tune-analysis-output.get_best_checkpoint(path, mode="max") to explore different loading possibilities over the checkpoints?

As answered in https://discuss.ray.io/t/ray-restore-checkpoint-in-rllib/3186/2 you can use:
analysis = tune.Analysis(experiment_path) # can also be the result of `tune.run()`
trial_logdir = analysis.get_best_logdir(metric="metric", mode="max") # Can also just specify trial dir directly
checkpoints = analysis.get_trial_checkpoints_paths(trial_logdir) # Returns tuples of (logdir, metric)
best_checkpoint = analysis.get_best_checkpoint(trial_logdir, metric="metric", mode="max")
See https://docs.ray.io/en/master/tune/api_docs/analysis.html#id1

analysis = tune.run(
"A2C",
name = model_name,
config = config,
...
checkpoint_freq = 5,
checkpoint_at_end = True,
restore = best_checkpoint
)
trial_logdir = analysis.get_best_logdir(metric="episode_reward_mean", mode="max")
best_checkpoint = analysis.get_best_checkpoint(trial_logdir, metric="episode_reward_mean", mode="max")

Generating a data table to iterate through in a Django Template

I have this function that uses PrettyTables to gather information about the Virtual Machines owned by a user. Right now, it only shows information and it works well. I have a new idea where I want to add a button to a new column which allows the user to reboot the virutal machine. I already know how to restart the virtual machines but what I'm struggling to figure out is the best way to create a dataset which i can iterate through and then create a HTML table. I've done similar stuff with PHP/SQL in the past and it was straight forward. I don't think I can iterate through PrettyTables so I'm wondering what is my best option? Pretty tables does a very good job of making it simple to create the table (as you can see below). I'm hoping to use another method, but also keep it very simple. Basically, making it relational and easy to iterate through. Any other suggestions are welcome. Thanks!
Here is my current code:
x = PrettyTable()
x.field_names = ["VM Name", "OS", "IP", "Power State"]
for uuid in virtual_machines:
vm = search_index.FindByUuid(None, uuid, True, False)
if vm.summary.guest.ipAddress == None:
ip = "Unavailable"
else:
ip = vm.summary.guest.ipAddress
if vm.summary.runtime.powerState == "poweredOff":
power_state = "OFF"
else:
power_state = "ON"
if vm.summary.guest.guestFullName == None:
os = "Unavailable"
else:
os = vm.summary.guest.guestFullName
x.add_row([vm.summary.config.name, os, ip, power_state])
table = x.get_html_string(attributes = {"class":"table table-striped"})
return table
Here is a sample of what it looks like and also what I plan to do with the button. http://prntscr.com/nki3ci

Figured out how to query the prettytable. It was a minor addition without having to redo it all.
html = '<table class="table"><tr><th>VM Name</th><th>OS</th><th>IP</th><th>Power
State</th></tr>'
htmlend = '</tr></table>'
body = ''
for vmm in x:
vmm.border = False
vmm.header = False
vm_name = (vmm.get_string(fields=["VM Name"]))
operating_system = (vmm.get_string(fields=["OS"]))
ip_addr = ((vmm.get_string(fields=["IP"])))
body += '<tr><td>'+ vm_name + '</td><td>' + operating_system + '</td> <td>'+ ip_addr +'</td> <td>ON</td></tr>'
html += body
html += htmlend
print(html)

Play WS standalone for 2.5.x

I want to create a Play web service client outside a Play application. For Play WS version 2.4.x it is easy to find that it is done like this:
val config = new NingAsyncHttpClientConfigBuilder().build()
val builder = new AsyncHttpClientConfig.Builder(config)
val client = new NingWSClient(builder.build)
However in 2.5.x the NingWSClient is now deprecated - instead the AhcWSClient should be used.
Unfortunately, I didn't find a complete example that explains the creation and usage of a AhcWsClient outside of Play. Currently I go with this:
import play.api.libs.ws.ahc.AhcWSClient
import akka.stream.ActorMaterializer
import akka.actor.ActorSystem
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val ws = AhcWSClient()
val req = ws.url("http://example.com").get().map{
resp => resp.body
}(system.dispatcher)
Is this the correct way of creating a AhcWsClient? And is there a way of creating a AhcWSClient without an ActorSystem?

You are probably using compile time dependency injection, otherwise you would just use #Inject() (ws: WSClient), right?.
There is one example in the docs: https://www.playframework.com/documentation/2.5.x/ScalaWS#using-wsclient
So you could write something like this in your application loader:
lazy val ws = {
import com.typesafe.config.ConfigFactory
import play.api._
import play.api.libs.ws._
import play.api.libs.ws.ahc.{AhcWSClient, AhcWSClientConfig}
import play.api.libs.ws.ahc.AhcConfigBuilder
import org.asynchttpclient.AsyncHttpClientConfig
val configuration = Configuration.reference ++ Configuration(ConfigFactory.parseString(
"""
|ws.followRedirects = true
""".stripMargin))
val parser = new WSConfigParser(configuration, environment)
val config = new AhcWSClientConfig(wsClientConfig = parser.parse())
val builder = new AhcConfigBuilder(config)
val logging = new AsyncHttpClientConfig.AdditionalChannelInitializer() {
override def initChannel(channel: io.netty.channel.Channel): Unit = {
channel.pipeline.addFirst("log", new io.netty.handler.logging.LoggingHandler("debug"))
}
}
val ahcBuilder = builder.configure()
ahcBuilder.setHttpAdditionalChannelInitializer(logging)
val ahcConfig = ahcBuilder.build()
new AhcWSClient(ahcConfig)
}
applicationLifecycle.addStopHook(() => Future.successful(ws.close))
And then inject ws to your controllers. I'm not 100% sure with this approach, I would be happy if some Play guru could validate this.
Regarding an ActorSystem, you need it only to get a thread pool for resolving that Future. You can also just import or inject the default execution context:
play.api.libs.concurrent.Execution.Implicits.defaultContext.
Or you can use your own:
implicit val wsContext: ExecutionContext = actorSystem.dispatchers.lookup("contexts.your-special-ws-config").

AFAIK this is the proper way to create the AhcWSClient - at least in 2.5.0 and 2.5.1 - as seen in the Scala API
You can, of course, always take another HTTP client - there are many available for Scala - like Newman, Spray client, etc. (although Spray is also based on Akka so you would have to create an actor system as well)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to update GCP Scheduler Jobs with Python - google-cloud-platform

Related

Fastest way to ingest data from BigQuery to PubSub

AWS LEX: Slot update, intent update and then new publishing bot through a Lambda function

ray restore checkpoint in rllib

Generating a data table to iterate through in a Django Template

Play WS standalone for 2.5.x

Categories

Resources