I've deployed a linear model for classification on Google Machine Learning Engine and want to predict new data using online prediction.
When I called the APIs using Google API client library, it took around 0.5s to get the response for a request with only one instance. I expected the latency should be less than 10 microseconds (because the model is quite simple) and 0.5s was way too long. I also tried to make predictions for the new data offline using the predict_proba method. It took 8.2s to score more than 100,000 instances, which is much faster than using Google ML engine. Is there a way I can reduce the latency of online prediction? The model and server which sent the request are hosted in the same region.
I want to make predictions in real-time (the response is returned immediately after the APIs gets the request). Is Google ML Engine suitable for this purpose?
Some more info would be helpful:
Can you measure the network latency from the machine you are accessing the service to gcp? Latency will be lowest if you are calling from a Compute Engine instance in the same region that you deployed the model to.
Can you post your calling code?
Is this the latency to the first request or to every request?
To answer your final question, yes, cloud ml engine is designed to support a high queries per second.
Related
I'm using a single node Bigtable cluster for my sample application running on GKE. Autoscaling feature has been incorporated within the client code.
Sometimes I experience slowness (>80ms) for the GET calls. In order to investigate it further, I need some clarity around the following Bigtable behaviour.
I have cached the Bigtable table object to ensure faster GET calls. Is the table object persistent on GKE? I have learned that objects are not persistent on Cloud Function. Do we expect any similar behaviour on GKE?
I'm using service account authentication but how frequently auth tokens get refreshed? I have seen frequent refresh logs for gRPC Java client. I think Bigtable won't be able to serve the requests over this token refreshing period (4-5 seconds).
What if client machine/instance doesn't scale enough? Will it cause slowness for GET calls?
Bigtable client libraries use connection pooling. How frequently connections/channels close itself? I have learned that connections are closed after minutes of inactivity (>15 minutes or so).
I'm planning to read only needed columns instead of entire row. This can be achieved by specifying the rowkey as well as column qualifier filter. Can I expect some performance improvement by not reading the entire row?
According to GCP official docs you can get here the cause of slower performance of Bigtable. I would like to suggest you to go through the docs that might be helpful. Also you can see Troubleshooting performance issues.
I've trained a model using Google's AutoML Video Intelligence and now trying to make predictions on a video of just 2 seconds using nodejs client's batch prediction but the inference time is no where near production grade, it's taking almost like a minute to make prediction on just 2 seconds of video. Am I missing some setting here or it's the way it is right now?
Some findings on this issue:
Try to follow best practices and see how to improve model performance
I have found another latency issue in google groups for auto ml, it suggests that If you're putting the base64-encoded bytes directly in 'inputContent' you might want to consider uploading the input_video file directly to Google Cloud Storage and using 'inputUri' instead of ‘inputContent.’ This will reduce the request payload size and the upload-latency.
This might be caused by a quota limit, you can confirm in the logs (by job id) for quota errors
Finally, you can open an issue at the Public Issue Tracker with a sample video and command for issue reproduction and further investigation.
Good luck!
I have developed Django API which accepts images from livefeed camera using in the form of base64 as request. Then, In API this image is converted into numpy arrays to pass to machine learning model i.e object detection using tensorflow object API. Response is simple text of detected objects.
I need GPU based cloud instance where i can deploy this application for fast processing to achieve real time results. I have searched a lot but no such resource found. I believe google cloud console (instances) can be connected to live API but I am not sure how exactly.
Thanks
I assume that you're using GPU locally or wherever your Django application is hosted.
First thing is to make sure that you are using tensorflow-gpu and all the necessary setup for Cuda is done.
You can start your GPU instance easily on Google Cloud Platform (GCP). There are multiple ways to do this.
Quick option
Search for notebooks and start a new instance with the required GPU and
RAM.
Instead of the notebook instance, you can set up the instance separately if you need some specific OS and more flexibility on choosing the machine.
To access the instance with ssh simply add your ssh public key
to Metadata which can be seen when you open the instance details.
Setup Django as you would do on the server. To test it simply just debug run it on host 0 or 0.0.0.0 and preferred port.
You can access the APIs with the external IP of the machine which can be found out in the instance details page.
Some suggestions
While the first option is quick and dirty, it's not recommended to use that in production.
It is better to use some deployment services such as tensorflow-serving along with Kubeflow.
If you think that you're handling the inference properly itself, then make sure that you load balance the server properly. Use NGINX or any other good server along with gunicorn/uwsgi.
You can use redis for queue management. When someone calls the API, it is not necessary that GPU is available for the inference. It is fine not to use this when you have very less number of hits on the API per second. But when we think of scaling up, think of 50 requests per second which a single GPU can't handle at a time, we can use a queue system.
All the requests should directly go to redis first and the GPU takes the jobs required to be done from the queue. If required, you can always scale the GPU.
Google Cloud actually offers Cloud GPUs. If you are looking to perform higher level computations with your applications that require real-time capabilities I would suggest your look into the following link for more information.
https://cloud.google.com/gpu/
Compute Engine also provides GPUs that can be added to your virtual machine instances. Use GPUs to accelerate specific workloads on your instances such as Machine Learning and data processing.
https://cloud.google.com/compute/docs/gpus/
However, if your application requires a lot of resources you’ll need to increase your quota to ensure you have enough GPUs available in your project. Make sure to pick a zone where GPUs are available. If this requires much more computing power you would need to submit a request for an increase of your quota. https://cloud.google.com/compute/docs/gpus/add-gpus#create-new-gpu-instance
Since you would be using the Tensorflow API for your application on ML Engine I would advise you to take a look at this link below. It provides instructions for creating a Deep Learning VM instance with TensorFlow and other tools pre-installed.
https://cloud.google.com/ai-platform/deep-learning-vm/docs/tensorflow_start_instance
What is the fastest expected response time of the Google Speech API with streaming audio data? I am sending an audio stream to the API and am receiving the interim results with a 2000ms delay, of which I was hoping I could drop to below 1000ms. I have tested different sampling rates and different voice models.
I'm afraid that response time can't be measured or guaranteed because of the nature of the service. We don't know what is done under the hood, in fact there is no SLA for response time even though there is SLA for availability.
Something that can help you is working on building a good request:
Reducing 100-miliseconds frame size, for example, could ensure a good tradeoff between latency and efficiency.
Following Best Practices will help you to make a clean request so that the latency can be reduced.
You may want to check following links on specific uses cases to know how they addressed latency issues:
Realtime audio streaming to Google Speech engine
How to speed up google cloud speech
25s Latency in Google Speech to Text
If you really care about response time you'd better use Kaldi-based service on your own infrastructure. Something like https://github.com/alumae/kaldi-gstreamer-server together with https://github.com/Kaljurand/dictate.js
Google Cloud Speech itself works pretty fast, you can check how quick your microphone gets transcribed https://cloud.google.com/speech-to-text/.
You may probably experience buffering issue on your side, the tool you are using may buffer data before sending(buffer flush) to underlying device(stream).
You can find out how to decrease output buffer of that tool to lower values e.g. 2Kb, so data will reach Node app and Google service faster. Google recommends to send data that equals to 100ms buffer size.
How should we architect a solution that uses Amazon Mechanical Turk API to process a stream of tasks instead of a single batch of bulk tasks?
Here's more info:
Our app receives a stream of about 1,000 photos and videos per day. Each picture or video contains 6-8 numbers (it's the serial number of an electronic device) that need to be transcribed, along with a "certainty level" for the transcription (e.g. "Certain", "Uncertain", "Can't Read"). The transcription will take under 10 seconds per image and under 20 seconds per video and will require minimal skill or training.
Our app will get uploads of these images continuously throughout the day and we want to turn them into numbers within a few minutes. The ideal solution would be for us to upload new tasks every minute (under 20 per minute during peak periods) and download results every minute too.
Two questions:
To ensure a good balance of fast turnaround time, accuracy, and cost effectiveness, should we submit one task at a time, or is it best to batch tasks? If so, what variables should we consider when setting a batch size?
Are there libraries or hosted services that wrap the MTurk API to more easily handle use-cases like ours where HIT generation is streaming and ongoing rather than one-time?
Apologies for the newbie questions, we're new to Mechanical Turk.
Streaming tasks one at a time to Turk
You can stream tasks individually through mechanical turk's api by using the CreateHIT operation. Every time you receive an image in your app, you can call the CreateHIT operation to immediately send the task to Turk.
You can also setup notifications through the api, so you can be alerted as soon as a task is completed. Turk Notification API Docs
Batching vs Streaming
As for batching vs streaming, you're better off streaming to achieve a good balance of turnaround time and cost. Batching won't drive down costs too much and improving accuracy is largely dependent on vetting, reviewing, and tracking worker performance either manually or implementing automated processes.
Libraries and Services
Most libraries offer all of the operations available in the api, so you can just google or search Github for a library in your programming language. (We use the Ruby library rturk)
A good list of companies that offer hosted solutions can be found under the Metaplatforms section of a answer on Quora to the question: What are some crowdsourcing services similar to Amazon Mechanical Turk? (Disclaimer: my company, Houdini is one of the solutions listed there.)