What is Google Clouds anomaly detection solution for time series streaming data similar to AWS' Kinesis Random Cut Forest algorithm? - amazon-web-services

Im trying to implement an anomaly detection machine learning solution on GCP but finding it hard to find a specific solution using Google Cloud ML as with AWS' Random Cut Forest solution in Kinesis. Im streaming IoT temperature sensor data for water heaters.
Anyone know a tensorflow/google solution for this as my company only uses google stack?
Ive tried using sklearn models but none of them are implementable on producton for streaming data so have to use tensorflow but am novice. Any suggestions on a good flow to get this done?

I would suggest using Esper complex event processing engine if primary concern is the analysis of data stream and catching patterns in real time. It provides SQL like event processing language which runs as continuous query on floating data. Esper offers abstractions for correlation, aggregation and pattern detection. It is open source project and license is required if you want to run engine on multiple servers to achieve high availability.

Related

Data Quality Framework in AWS

I am trying to implement a data quality framework for an application which ingests data from various systems(batch, near real time, real time). Few items that I want to highlight here are:
The data pipelines widely vary and ingest very high volumes of data. They are developed using spark,python,emr clusters, kafka, Kinesis stream
Any new system that we onboard in the framework, it should be easily
able to include the data quality checks with minimal coding. so some sort of metadata framework might help for ex: storing the business rules in dynamodb which can automatically run check different feeders/new data pipeline created
Our tech stack includes AWS,Python,Spark, Java, so kindly advise related services(AWS Databrew, PyDeequ, Greatexpectations libraries, various lambda event driven services are some I want to focus)
I am also looking for some sort of audit, balance and control mechanism. Auditing the source data, balancing # of records between 2 points and have some automated mechanism to remediate(control) them.
I am looking for testing frameworks for the different data pipelines.
Also for data profiling, kindly advise tools/libraries, Aws data brew, Pandas are some I am exploring.
I know there wont be one specific solution, and hence appreciate all and any different ideas. A flow diagram with Audit, balance and control with automated data validation and testing mechanism for data pipelines can be very helpful.
Thanks!!!

Google AutoML Video Intelligence Tools?

I'm using AutoML Video Intelligence and it's very tedious and I was wondering if there was an easier way to create Datasets for the object tracking. An easy way to get the time and position of the box?
I'm pretty sure that you can find the answers on the mentioned questions reading GCP knowledge base documentation in particular about AutoML Video Intelligence product.
At least Object tracking process is nicely explained in terms of implementation with either GCP console UI or constructing HTTP calls to Cloud REST AutoML API.
Furthermore, you can find example tutoring the way how to handle video segments positioning for the relevant prediction requests.
You can adjust initial question, extending it with a certain details about your use case in order to preciously address the solution.

Google Cloud AutoML Natural Language for Chatbot like application

I want to develop a chatbot like application which gives response to input questions using Google Cloud Platform.
Naturally, Dialogflow is suited for this such applications. But due to business conditions, I cannot use Dialogflow.
An alternative could be AutoML Natural Language, where I do not need much machine learning expertise.
AutoML Natural Language requires documents which are labelled. These documents can be used for training a model.
My example document:
What is cost of Swiss tour?
Estimate of Switzerland tour?
I would use a label such as Switzerland_Cost for this document.
Now, in my application I would have a mapping between Labels and Responses.
During Prediction, when I give an input question to the trained model, I would get a predicted label. I can then use this label to return the mapped response.
Is there a better approach to my scenario?
I'm from Automl team. This seems like a good approach to me. People use Automl NL for intent detection, which is pretty aligned with what you try to do here.

aws sagemaker for detecting text in an image

I am aware that it is better to use aws Rekognition for this. However, it does not seem to work well when I tried it out with the images I have (which are sort of like small containers with labels on them). The text comes out misspelled and fragmented.
I am new to ML and sagemaker. From what I have seen, the use cases seem to be for prediction and image classification. I could not find one on training a model for detecting text in an image. Is it possible to to do it with Sagemaker? I would appreciate it if someone pointed me in the right direction.
The different services will all provide different levels of abstraction for Optical Character Recognition (OCR) depending on what parts of the pipeline you are most comfortable with working with, and what you prefer to have abstracted.
Here are a few options:
Rekognition will provide out of the box OCR with the DetectText feature. However, it seems you will need to perform some sort of pre-processing on your images in your current case in order to get better results. This can be done through any method of your choice (Lambda, EC2, etc).
SageMaker is a tool that will enable you to easily train and deploy your own models (of any type). You have two primary options with SageMaker:
Do-it-yourself option: If you're looking to go the route of labeling your own data, gathering a sizable training set, and training your own OCR model, this is possible by training and deploying your own model via SageMaker.
Existing OCR algorithm: There are many algorithms out there that all have different potential tradeoffs for OCR. One example would be Tesseract. Using this, you can more closely couple your pre-processing step to the text detection.
Amazon Textract (In preview) is a purpose-built dedicated OCR service that may offer better performance depending on what your images look like and the settings you choose.
I would personally recommend looking into pre-processing for OCR to see if it improves Rekognition accuracy before moving onto the other options. Even if it doesn't improve Rekognition's accuracy, it will still be valuable for most of the other options!

How to make my datalab machine learning run faster

I got some data, which is 3.2 million entries in a csv file. I'm trying to use CNN estimator in tensorflow to train the model, but it's very slow. Everytime I run the script, it got stuck, like the webpage(localhost) just refuse to respond anymore. Any recommendations? (I've tried with 22 CPUs and I can't increase it anymore)
Can I just run it and use a thread, like the command line python xxx.py & to keep the process going? And then go back to check after some time?
Google offers serverless machine learning with TensorFlow for precisely this reason. It is called Cloud ML Engine. Your workflow would basically look like this:
Develop the program to train your neural network on a small dataset that can fit in memory (iron out the bugs, make sure it works the way you want)
Upload your full data set to the cloud (Google Cloud Storage or BigQuery or &c.) (documentation reference: training steps)
Submit a package containing your training program to ML Cloud (this will point to the location of your full data set in the cloud) (documentation reference: packaging the trainer)
Start a training job in the cloud; this is serverless, so it will take care of scaling to as many machines as necessary, without you having to deal with setting up a cluster, &c. (documentation reference: submitting training jobs).
You can use this workflow to train neural networks on massive data sets - particularly useful for image recognition.
If this is a little too much information, or if this is part of a workflow that you'll be doing a lot and you want to get a stronger handle on it, Coursera offers a course on Serverless Machine Learning with Tensorflow. (I have taken it, and was really impressed with the quality of the Google Cloud offerings on Coursera.)
I am sorry for answering even though I am completely igonorant to what datalab is, but have you tried batching?
I am not aware if it is possible in this scenario, but insert maybe only 10 000 entries in one go and do this in so many batches that eventually all entries have been inputted?