Severities of all logs on AI Platform are errors - google-cloud-ml

On Google AI Platform, all logs printed on stderr are interpreted as ERROR.
Is there any way to print logs as INFO, WARNING, and CRITICAL?

Take a look at the AI Platform troubleshooting documentation for training logs. There is stated that You can put logging events in your application with standard Python libraries, such like logging.
I haven't tried it but it seems that you can use the logger object class to set the log level desired:
Logger.info(msg, *args, **kwargs)
Logs a message with level INFO on this logger. The arguments are interpreted as for debug().
Logger.warning(msg, *args, **kwargs)
Logs a message with level WARNING on this logger. The arguments are interpreted as for debug().
Logger.critical(msg, *args, **kwargs)
Logs a message with level CRITICAL on this logger. The arguments are interpreted as for debug().

Related

Unable to limit number of Cloud Run instances spun up by Pub/Sub

Situation:
I'm trying to have a single message in Pub/Sub processed by exactly 1 instance of Cloud Run. Additional messages will be processed by another instance of Cloud Run. Each message triggers a heavy computation that runs for around 100s in the Cloud Run instance.
Currently, Cloud Run is configured with max concurrency requests = 1, and min/max instances of 0/5. Subscription is set to allow for 600s Ack deadline.
Issue:
Each message seems to be triggering multiple instances of Cloud Run to be spun up. I believe that it is due to high CPU utilization that's causing Cloud Run to spin up additional instances to help process. Unfortunately, these new instances are attempting to process the same exact message, causing unintented results.
Question:
Is there a way to force Cloud Run to only have 1 instance process a single message, regardless of CPU utilization and other potential factors?
Relevant Code Snippet:
import base64
import json
from fastapi import FastAPI, Request, Response
app = FastAPI()
#app.post("/")
async def handleMessage(request: Request):
envelope = await request.json()
# Basic data validation
if not envelope:
msg = "no Pub/Sub message received"
print(f"error: {msg}")
return Response(content=msg, status_code=400)
if not isinstance(envelope, dict) or "message" not in envelope:
msg = "invalid Pub/Sub message format"
print(f"error: {msg}")
return Response(content=msg, status_code=400)
message = envelope["message"]
if isinstance(message, dict) and "data" in message:
data = json.loads(base64.b64decode(message["data"]).decode("utf-8").strip())
try:
# Do computationally heavy operations here
# Will run for about 100s
return Response(status_code=204)
except Exception as e:
print(e)
Thanks!
I've found the issue.
Apparently, Pub/Sub guarantees "at least once" delivery, which means it is possible for it to deliver a message to a subscriber more than once. The onus is therefore on the subscriber, which in my case is Cloud Run, to handle such scenarios (idempotency) gracefully.

Attach additional info to Lambda time-out message?

When a Lambda times out, it outputs a message to CloudWatch (if enabled) saying "Task timed out".
It would be beneficial to attach additional info (such as the context of the offending call) to the message. Right now I'm writing the context to CloudWatch at the start of the invocation - but it would sometimes be preferable if everything was contained within a single message.
Is something like that possible?
Unfortunately there is no almost-timed-out-hook. You may however be able to inspect the context object you get in the Lambda handler to look at the remaining run time and if it gets close to timing out printing out the additional info.
In python you could use context.get_remaining_time_in_millis() as per the documentation to get that info.
There is no timeout hook for lambda but can be implemented with a little bit of code
import signal
def handler(event, context):
....
signal.alarm((context.get_remaining_time_in_millis())
.....
def timeout_handler(_signal, _frame):
raise Exception('other information')
We implemented something like this for a lot of custom handlers in cloudformation.

Google cloud functions missing logs issue

I have a small python CF conencted to a PubSub topic that should send out some emails using the sendgrid API.
The CF can dynamically load & run functions based on a env var (CF_FUNCTION_NAME) provided (monorepo architecture):
# main.py
import logging
import os
from importlib import import_module
def get_function(function_name):
return getattr(import_module(f"functions.{function_name}"), function_name)
def do_nothing(*args):
return "no function"
cf_function_name = os.getenv("CF_FUNCTION_NAME", False)
disable_logging = os.getenv("CF_DISABLE_LOGGING", False)
def run(*args):
if not disable_logging and cf_function_name:
import google.cloud.logging
client = google.cloud.logging.Client()
client.get_default_handler()
client.setup_logging()
print("Logging enabled")
cf = get_function(cf_function_name) if cf_function_name else do_nothing
return cf(*args)
This works fine, except for some issues related to Stackdriver logging:
The print statement "Logging enabled" shoud be printed every invocation, but only happens once?
Exceptions rasied in the dynamically loaded function are missing in the logs, instead the logs just show 'finished with status crash', which is not very useful.
Screenshot of the stackdriver logs of multiple subsequent executions:
stackdriver screenshot
Is there something I'm missing here?
Is my dynamic loading of funcitons somehow messing witht the logging?
Thanks.
I don't see any issue here. When you load your function for the first time, one instance is created and the logging is enabled (your logging trace). Then, the instance stay up until its eviction (unpredictable!).
If you want to see several trace, perform 2 calls in the same time. Cloud Function instance can handle only one request at the same time. 2 calls in parallel imply the creation of another instance and thus, a new logging initialisation.
About the exception, same things. If you don't catch and print it, nothing will be logged. Simply catch them!
It seems like there is an issue with Cloud Functions and Python for a month now, where errors do not get logged automatically with tracebacks and categorized correctly as "Error": GCP Cloud Functions no longer categorizes errors correctly with tracebacks

Google Cloud Scheduler trigger dataflow template batch job fails with "INVALID ARGUMENT"

I have a dataflow template that I schedule or trigger using a Google Cloud Scheduler. We change the job quite often during development that involves changes to the arguments as well. Quite often we find that trigger fails with status 400 and INVALID_ARGUMENT. Since there are multiple arguments it becomes difficult to figure which argument that is passed is invalid.
Is there a better way to figure out which argument is causing the trigger to fail rather than manual?
From the Common error guidance: Bad error you can not see in Stackdirver those arguments.
If it written in Python you can expose the arguments using logging:
# import Python logging module.
import logging
class ExtractWordsFn(beam.DoFn):
def process(self, *arg, **kwarg):
logging.info('Arguments: %s', arg)
logging.info('Key-value args: %s', kwarg)
my,arguments = arg
# REST OF YOUR CODE

Learning the Twisted framework and having trouble with the finger server

I am learning the Twisted framework for a project I am working on by using the Twisted Documentation Finger tutorial (http://twistedmatrix.com/documents/current/core/howto/tutorial/intro.html) and I'm having trouble getting my program to work.
Here's the code for the server, it should return "No Such User" when I telnet localhost 12345, but it just stays there, nothing happening.
from twisted.internet import protocol, reactor
from twisted.protocols import basic
class FingerProtocol(basic.LineReceiver):
def lineReceived(self, user):
self.transport.write("No such user\r\n")
self.transport.loseConnection()
class FingerFactory(protocol.ServerFactory):
protocol = FingerProtocol
reactor.listenTCP(12345, FingerFactory())
reactor.run()
I have run the server via python twisted-finger.py and sudo python twisted-finger.py, but neither worked.
Does anyone see why this doesn't return the message it is supposed to?
You have to send a finger request to the server before it responds.
According to the finger rfc:
Send a single "command line", ending with <CRLF>.
The command line:
Systems may differ in their interpretations of this line. However,
the basic scheme is straightforward: if the line is null (i.e. just
a <CRLF> is sent) then the server should return a "default" report
which lists all people using the system at that moment. If on the
other hand a user name is specified (e.g. FOO<CRLF>) then the
response should concern only that particular user, whether logged in
or not.
Try typing a word into telnet and hitting enter.