Google Cloud Scheduler trigger dataflow template batch job fails with "INVALID ARGUMENT" - google-cloud-platform

I have a dataflow template that I schedule or trigger using a Google Cloud Scheduler. We change the job quite often during development that involves changes to the arguments as well. Quite often we find that trigger fails with status 400 and INVALID_ARGUMENT. Since there are multiple arguments it becomes difficult to figure which argument that is passed is invalid.
Is there a better way to figure out which argument is causing the trigger to fail rather than manual?

From the Common error guidance: Bad error you can not see in Stackdirver those arguments.
If it written in Python you can expose the arguments using logging:
# import Python logging module.
import logging
class ExtractWordsFn(beam.DoFn):
def process(self, *arg, **kwarg):
logging.info('Arguments: %s', arg)
logging.info('Key-value args: %s', kwarg)
my,arguments = arg
# REST OF YOUR CODE

Related

How can I track the progress/status of an asynchronous AWS Lambda invocation?

I have an API which I use to trigger AWS Lambda jobs. Upon request, the API invokes an AWS Lambda job with InvocationType='Event'. Hereafter, I want to periodically poll if the AWS Lambda job has finished.
The way that would fit best to my architecture, is to store an identifier of the Lambda job in a database and periodically check if the job is finished and what its output is. However, I was not able to find how I can do this.
How can I periodically poll for the result of an AWS Lambda job, and view the output once it has finished?
I have looked into using InvocationType='RequestResponse', but this requires me to store a future, which I cannot do in a database.
There's no built-in way to check for the status of an asynchronous Lambda invocation.
Asynchronous Lambda invocation, using the event invocation type, is meant to be a fire and forget job. As such, there's no 'progress' or 'status' to get or poll for.
As you don't want to wait for the Lambda to complete, synchronous Lambda invocation is out of the picture. In this case, you need to write your own logic to keep track of the status.
One way you could do this is to store a (job) item in a DynamoDB jobs table with 2 attributes:
jobId UUID (String attribute, set as the partition key)
completed boolean flag (Boolean attribute)
Workflow is then as follows:
Within your API, create & store a new job with completed defaulting to 'false'
Pass the newly-created jobId to the Lambda being invoked in the payload
When the Lambda finishes, lookup the job associated with the passed in jobId within the jobs table & set the completed attribute of the job to true
You can then periodically poll for the result of the job within the DynamoDB table.
Or take a look at using DynamoDB Streams as a way to know when a job finishes in near-real time without polling.
As to viewing the 'output', AWS Lambda just returns a success response without additional information. There is no 'output'. Store any output you might need in persistent storage - maybe an extra output attribute as a String with each job? - & later retrieve it.
#Ermiya Eskandary's answer is absolutely right.
I am a Dynamodb Subject matter expert, and did this status tracking (also error handling, retry, error logging) pattern for many of my customers
You could check the pynamodb_mate library, it has the status tracker pattern implemented and you can enable that with around 15 lines of code.
in general, when you say you want status tracking, you are talking about the following:
Each task should be handled by only one worker, you want a concurrency lock mechanism to avoid double consumption. (a lot of people didn't aware of this, it is called Idempotent)
For those succeeded tasks, store additional information such as the output of the task and log the success time.
For those failed task, log the error message for debug, so you can fix the bug and rerun the task.
For those failed task, you want to get all of failed tasks by one simple query and rerun with the updated business logic.
For those tasks failed too many times, you don't want to retry them anymore and wants to ignore them. (a lot of people run into endless loop when they deploy to production then realize that it is a necessary feature)
Run custom query based on task status for analytics purpose.
You can read this jupyter notebook example
Basically, with pynamodb_mate your lambda job application code become:
# this is your lambda application code
def lambda_handler(...):
...
# your new code should be:
with tracker.start_job():
lambda_handler()
If your application code is not Python, then you have two options:
create another lambda function that invoke the original one using sync mode. however, you pay more money to run the "caller" lambda function
suppose your lambda code in in Node.js, then add additional lambda runtime as a layer and wrap your node.js caller around a Python function. In short, you are using Python to call node.js.

Get status of scheduler job from python

I have a scheduled job running on Cloud Scheduler, and I would like to get its status ("Success", "Failed") from python. There is a python client for cloud scheduler here but can't find documentation on how to get the status.
You can get the status with the library like that
from google.cloud.scheduler import CloudSchedulerClient
client = CloudSchedulerClient()
print(client.list_jobs(parent="projects/PROJECT_ID/locations/LOCATION"))
I chose list_job but you can also use get job.
In the JSON object that you receive, you have a status field. If empty (meaning no error), the latest call was in success. If not, it was in error and you have the GRPC error code in the field.

Google cloud functions missing logs issue

I have a small python CF conencted to a PubSub topic that should send out some emails using the sendgrid API.
The CF can dynamically load & run functions based on a env var (CF_FUNCTION_NAME) provided (monorepo architecture):
# main.py
import logging
import os
from importlib import import_module
def get_function(function_name):
return getattr(import_module(f"functions.{function_name}"), function_name)
def do_nothing(*args):
return "no function"
cf_function_name = os.getenv("CF_FUNCTION_NAME", False)
disable_logging = os.getenv("CF_DISABLE_LOGGING", False)
def run(*args):
if not disable_logging and cf_function_name:
import google.cloud.logging
client = google.cloud.logging.Client()
client.get_default_handler()
client.setup_logging()
print("Logging enabled")
cf = get_function(cf_function_name) if cf_function_name else do_nothing
return cf(*args)
This works fine, except for some issues related to Stackdriver logging:
The print statement "Logging enabled" shoud be printed every invocation, but only happens once?
Exceptions rasied in the dynamically loaded function are missing in the logs, instead the logs just show 'finished with status crash', which is not very useful.
Screenshot of the stackdriver logs of multiple subsequent executions:
stackdriver screenshot
Is there something I'm missing here?
Is my dynamic loading of funcitons somehow messing witht the logging?
Thanks.
I don't see any issue here. When you load your function for the first time, one instance is created and the logging is enabled (your logging trace). Then, the instance stay up until its eviction (unpredictable!).
If you want to see several trace, perform 2 calls in the same time. Cloud Function instance can handle only one request at the same time. 2 calls in parallel imply the creation of another instance and thus, a new logging initialisation.
About the exception, same things. If you don't catch and print it, nothing will be logged. Simply catch them!
It seems like there is an issue with Cloud Functions and Python for a month now, where errors do not get logged automatically with tracebacks and categorized correctly as "Error": GCP Cloud Functions no longer categorizes errors correctly with tracebacks

Usage of concurrent.futures.ThreadPoolExecutor throws timeout exception always in aws lambda

I have the following code in aws lambda to get response from an API until the status is complete. I have used the ThreadPoolExecutor from concurrent.futures.
Here is the sample code.
import requests
import json
import concurrent.futures
def copy_url(headers,data):
collectionStatus = 'INITIATED'
retries = 0
print(" The data to be copied is ",data)
while (collectionStatus != 'COMPLETED' or retries <= 50):
r = requests.post(
url=URL,
headers=headers,
data=json.dumps(data))
final_status= r.json().get('status').pop().get('status')
retries += 1
print(" The collection status is",final_status)
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future = executor.submit(copy_url,headers,data)
return_value = future.result()
I had already implemented this using regular threads in python. However, since I wanted a return value from the thread tried implementing this. Though this works perfectly in pycharm, it always throws a timeout error in aws lambda.
Could someone please explain why this happens only in aws-lambda?
Note : I have already tried increasing the lambda timeout value. This happens only when threadpoolexecutor is implemented. When I comment out that code it works fine.Also it works fine with the regular python thread implementation
Finally, I changed the implementation to listening to a SQS trigger rather than waiting for the response from an API (The API is handled by a different component and response will take a significant amount of time)
Looks like we should avoid using parallel processing tasks with python in aws lambda.
From the AWS docs:
The multiprocessing module that comes with Python lets you run
multiple processes in parallel. Due to the Lambda execution
environment not having /dev/shm (shared memory for processes) support,
you can’t use multiprocessing.Queue or multiprocessing.Pool.
If multiprocessing ought to be used, only PIPE is supported.
The question was related to multithread execution, but the AWS documentation listed on the answer is related to multiprocessing, they are different implementations.
Multiprocess will open a new child process to execute the operation
Multithread will create a new thread on the same process to execute the operation.
More information on this answer: Multiprocessing vs Threading Python

Schedule task in Django using schedule package

I am trying to learn how to scheduled a task in Django using schedule package. Here is the code I have added to my view. I should mention that I only have one view so I need to run scheduler in my index view.. I know there is a problem in code logic and it only render scheduler and would trap in the loop.. Can you tell me how can I use it?
def job():
print "this is scheduled job", str(datetime.now())
def index(request):
schedule.every(10).second.do(job())
while True:
schedule.run_pending()
time.sleep(1)
objs= objsdb.objects.all()
template = loader.get_template('objtest/index.html')
context= { 'objs': objs}
return HttpResponse(template.render(context, request))
You picked the wrong approach. If you want to schedule something that should run periodically you should not do this within a web request. The request never ends, because of the wile loop - and browsers and webservers very much dislike this behavior.
Instead you might want to write a management command that runs on its own and is responsible to call your tasks.
Additionally you might want to read Django - Set Up A Scheduled Job? - they also tell something about other approaches like AMPQ and cron. But those would replace your choice of the schedule module.