Cloud Run 503 error due to high cpu usage - google-cloud-platform

I just implemented cloud run to process/encode video for my mobile application. I have recently gotten an unknown 503 error: POST 503 Google-Cloud-Tasks: The request failed because the HTTP connection to the instance had an error.
My process starts when a user uploads a video to cloud storage, then a function is triggered and sends the video source path to cloud tasks to be enqueued for encoding. Finally cloud run downloads the video, processes it via ffmpeg, and uploads everything to a separate bucket (all downloaded temp files are deleted).
I know video encoding is a cpu heavy task, but my application only allows up to ~3 minute videos to be encoded (usually around 100 MB). It works perfectly fine for shorter videos, but ones on the longer end flag the 503 error after processing for 2+ minutes
My instances are only used for video encoding and only allow 1 concurrent request/instance. Here are my services settings:
CPU - 2 cpu
Memory - 2 Gb
Concurrency - 1
Request Timeout - 900 seconds (15 minutes)
The documentation states that it is because of heavy cpu tasks so it's clear it is caused by the processing of heavier files, but I'm unsure what I can do to fix this given the max settings. Is it possible to set a cap on the CPU so it doesn't go overboard? Or is cloud run not a good solution for this kind of task?

Related

How to do "live request batching" in gcloud

Here is my situation:
I have a rather slow tensorflow model that runs on GPU (2 to 3 seconds per prediction)
A prediction for a single 'entity' vs a prediction for 8 'entities' takes about the same time
This means I could be 8 times as efficient by simply combining multiple predictions in the same request
I have a service on AI platform serving requests to that model
The service works for slow request rates but has trouble scaling up (anything over 4 QPS is too much to handle)
My question then is:
Is there a standard way / best practice for batching live client requests:
When receiving a request, wait a little bit for other requests
After a while, or when the number of requests reaches a set number, forward the requests in a single "batch" to another service.
If traffic is low, the delay will expire before the batch is full, but since traffic is low, that's not an issue
If traffic is high, the batch will be full before the delay, and the client will have to wait less
I have an almost-working solution with app-engine + firebase (for hosting the shared 'queue') but implementing the delay is giving me trouble (app engine doesn't seem to like python's threading.Timer
I'd appreciate something that could work with app engine, but at this point I'm open to any suggestions (as long as it is applicable on google cloud).
Thanks!
The perfect (but not the cheapest) is to use Dataflow.
When a prediction request comes in, publish it in PubSub
Deploy a dataflow in streaming mode, with fixed windows of X minutes, and another trigger, not accumulated, after Y event in the window.
When a window trigger is performed (either on the number of messages or on the timer) do the batch processing
You can imagine other designs, simpler/cheaper.
Still publish the prediction requests in PubSub
You can schedule a Cloud Functions, or a Cloud Run every X minutes to pull the pubsub subscription and then to trigger the batch job. But, it's a fixed time.
When you publish the message in PubSub, you can also store, in firestore for example, and increase a counter and the date of the 1st message published in PubSub.
If the number of message is above your threshold, perform a request to your other process that pull the PubSub subscription and run the batch processing (as before #1). Reset the counter value and the message date value
Set up a cloud scheduler which check, every minute, the value of the 1st message date in Firestore. If it's above your time limit, perform a request to your other process that pull the PubSub subscription and run the batch processing (as before #1). Reset the counter value and the message date value
The #2 will generate a lot of Firestore read/write, but will be cheaper than dataflow.

I have tested my AWS server (8 GB RAM) on which my Moodle site is hosted for 1000 users using JMeter, I am getting 0% error, what could be the issue?

My moodle site is hosted on AWS Server of 8 GB RAM, i carried out various tests on the server using JMeter (NFT), I have tested from 15 to almost 1000 users, however I am still not getting any error(less than 0.3%). I am using the scripts provided by moodle itself. What could be the issue? Is there any issue with the script? I have attached a screenshot with this which shows the reports of 1000 users test for referenceenter image description here
If you're happy with the amount of errors and response times (maximum response time is more than 1 hour which is kind of too much for me) you can stop here and report the results.
However I doubt that a real user will be happy to wait 1 hour to see the login page so I would rather define some realistic pass/fail criteria, for example would expect the response time to be not more than 5 seconds. In this case you will have > 60% of failures if this is what you're trying to achieve.
You can consider using the following test elements
Set reasonable response timeouts using HTTP Request Defaults:
so if any request will last longer than 5 seconds it will be terminated as failed
Or use Duration Assertion
in this case JMeter will wait for the response and mark it as failed if the response time exceeds the defined duration

Simple HelloWorld app on cloudrun (or knative) seems too slow

I deployed a sample HelloWorld app on Google Cloud Run, which is basically k-native, and every call to the API takes 1.4 seconds at best, in an end-to-end manner. Is it supposed to be so?
The sample app is at https://cloud.google.com/run/docs/quickstarts/build-and-deploy
I deployed the very same app on my localhost as a docker container and it takes about 22ms, end-to-end.
The same app on my GKE cluster takes about 150 ms, end-to-end.
import os
from flask import Flask
app = Flask(__name__)
#app.route('/')
def hello_world():
target = os.environ.get('TARGET', 'World')
return 'Hello {}!\n'.format(target)
if __name__ == "__main__":
app.run(debug=True,host='0.0.0.0',port=int(os.environ.get('PORT', 8080)))
I am a little experience in FaaS and I expect API calls would get faster as I invoked them in a row. (as in cold start vs. warm start)
But no matter how many times I execute the command it doesn't go below 1.4 seconds.
I think the network distance isn't the dominant factor here. The round-trip time via ping to the API endpoint is only 50ms away, more or less
So my questions are as follows:
Is it potentially an unintended bug? Is it a technical difficulty which will be resolved eventually? Or maybe nothing's wrong, it's just the SLA of k-native?
If nothing's wrong with Google Cloud Run and/or k-native, what is the dominant time-consuming factor here for my API call? I'd love to learn the mechanism.
Additional Details:
Where I am located at: Seoul/Asia
The region for my Cloud Run app: us-central1
type of Internet connection I am testing under: Business, Wired
app's container image size: 343.3MB
the bucket location that Container Registry is using: gcr.io
WebPageTest from Seoul/Asia (warmup time):
Content Type: text/html
Request Start: 0.44 s
DNS Lookup: 249 ms
Initial Connection: 59 ms
SSL Negotiation: 106 ms
Time to First Byte: 961 ms
Content Download: 2 ms
WebPageTest from Chicago/US (warmup time):
Content Type: text/html
Request Start: 0.171 s
DNS Lookup: 41 ms
Initial Connection: 29 ms
SSL Negotiation: 57 ms
Time to First Byte: 61 ms
Content Download: 3 ms
ANSWER by Steren, the Cloud Run product manager
We have detected high latency when calling Cloud Run services from
some particular regions in the world. Sadly, Seoul seems to be one of
them.
[Update: This person has a networking problem in his area. I tested his endpoint from Seattle with no problems. Details in the comments below.]
I have worked with Cloud Run constantly for the past several months. I have deployed several production applications and dozens of test services. I am in Seattle, Cloud Run is in us-central1. I have never noticed a delay. Actually, I am impressed with how fast a container starts up.
For one of my services, I am seeing cold start time to first byte of 485ms. Next invocation 266ms, 360ms. My container is checking SSL certificates (2) on the Internet. The response time is very good.
For another service which is a PHP website, time to first byte on cold start is 312ms, then 94ms, 112ms.
What could be factors that are different for you?
How large is your container image? Check Container Registry for the size. My containers are under 100 MB. The larger the container the longer the cold start time.
Where is the bucket located that Container Registry is using? You want the bucket to be in us-central1 or at least US. This will change soon with when new Cloud Run regions are announced.
What type of Internet connection are you testing under? Home based or Business. Wireless or Ethernet connection? Where in the world are you testing from? Launch a temporary Compute Engine instance, repeat your tests to Cloud Run and compare. This will remove your ISP from the equation.
Increase the memory allocated to the container. Does this affect performance? Python/Flask does not require much memory, my containers are typically 128 MB and 256 MB. Container images are loaded into memory, so if you have a bloated container, you might now have enough memory left reducing performance.
What does Stackdriver logs show you? You can see container starts, requests, and container terminations.
(Cloud Run product manager here)
We have detected high latency when calling Cloud Run services from some particular regions in the world. Sadly, Seoul seems to be one of them.
We will explicitly capture this as a Known issue and we are working on fixing this before General Availability. Feel free to open a new issue in our public issue tracker

Echo Spot sometimes takes minutes to start playing a video

I'm currently developing a custom skill for the echo spot. I'm using AWS Lamda functions in .net core, using the Alexa.NET SDK. One of the intents lets Alexa play a video, which are hosted on a S3 bucket, but sometimes (randomly - once after opening the skill, once after the 4th or 5th video), Alexa immediately understands the command, but takes ages to play the video. According to the cloudwatch logs, the command is parsed and the lambda function executed within a couple of milliseconds, but the video starts playing very delayed (up to two minutes).
REPORT RequestId: xyz Duration: 366.44 ms Billed Duration: 400 ms Memory Size: 576 MB Max Memory Used: 79 MB
The videos being returned by the lambda function are rather short (5-15 seconds) if that could affect the issue. The wifi itself is stable with more than 30mbit available, alexa is not too far away from the wifi router.
We've tried different video encodings (MP4, H264, ...), different audio codecs, samplerates and framerates - the issue remains. Any clues what could cause this issue? We've read the recommendations for videos and applied all the recommended settings to the video.
Can i somehow access the device's logs to see if there's another issue with the video?
Turns out, videos are being streamed when combined with a plain text output speech. If your output speech is empty, the echo spot will download the whole video and start playing once the video is completely loaded. Hence, i recommend adding a speech reply to all of your videos to ensure a smooth loading of the video.

Counting number of requests per second generated by JMeter client

This is how application setup goes -
2 c4.8xlarge instances
10 m4.4xlarge jmeter clients generating load. Each client used 70 threads
While conducting load test on a simple GET request (685 bytes size page). I came across issue of reduced throughput after some time of test run. Throughput of about 18000 requests/sec is reached with 700 threads, remains at this level for 40 minutes and then drops. Thread count remains 700 throughout the test. I have executed tests with different load patterns but results have been same.
The application response time considerably low throughout the test -
According to ELB monitor, there is reduction in number of requests (and I suppose hence the lower throughput ) -
There are no errors encountered during test run. I also set connect timeout with http request but yet no errors.
I discussed this issue with aws support at length and according to them I am not blocked by any network limit during test execution.
Given number of threads remain constant during test run, what are these threads doing? Is there a metrics I can check on to find out number of requests generated (not Hits/sec) by a JMeter client instance?
Testplan - http://justpaste.it/qyb0
Try adding the following Test Elements:
HTTP Cache Manager
and especially DNS Cache Manager as it might be the situation where all your threads are hitting only one c4.8xlarge instance while the remaining one is idle. See The DNS Cache Manager: The Right Way To Test Load Balanced Apps article for explanation and details.