DetectIidSpike() with streaming data - ml.net

Most of the examples of DetectIidSpike load data from a file or an array in memory and train and run the model using this data.
What about load a saved model (created with the previous approach), and apply the model on incoming individual data points (from a data stream), so we can evaluate the anomalies on each one. Is that possible? How can it be done? Any examples, points? Thanks!

Related

How do I apply my model to a new dataset in WEKA?

I have created a new prediction model based on a dataset that was given to me. It predicts a nominal (binary) class attribute (positive/negative) based on a number of numerical attributes.
Now I have been asked to use this prediction model to predict classes for a new dataset. This dataset has all the same attributes except for the class column, which does not exist yet. How do I apply my model to this new data? I have tried adding an empty class column to my new dataset and then doing the following:
Simply loading the new dataset in WEKA's explorer and loading the model. It tells me there is no training data.
Opening my training set in WEKA's explorer and then opening my training model, then choosing my new data as a 'supplied test set'. It runs but does not output any predictions.
I should note that the model works fine when testing on the training data for cross validation. It also works fine with a subset of the training data I separated ages ago for test/eval use. I think it may be a problem with how I am adding a new class column, maybe?
For making predictions, Weka requires the two datasets, training and the one for making predictions, to have the exact same structure, down to the order of labels. That also means, that you need to have a class attribute with the correct labels present. In terms of values for your class attribute, simply use the missing value (denoted by a question mark).
See the FAQ How do i make predictions with a trained model? on the Weka wiki for more information on how to make predictions.

Making predictions in AutoML Natural Language UI

I've been testing the Natural Language UI functionality and created a model for Single-Label classification. To train the model I used a csv with two columns, the first column has the text and the second has the label.
Then I get to the "Test & Use" tab to perform predictions. I upload a csv file into GS and when I try to select it I get the message that "Invalid file type, only following file types allowed: pdf, tif, tiff"
I was wondering whether I can use a csv file similar to when I trained the model.
Picture:
As per the documentation, you can't do batch prediction from the UI.
You need to use the API's batchPredict method to do so.
If you would like to use your model to do high-throughput asynchronous prediction on a corpus of documents you can use the batchPredict method. The batch prediction methods require you to specify input and output URIs that point to locations in Cloud Storage buckets.
The input URI points to a CSV or JSONL file, which specifies the content to analyze
Example:
gs://folder/text1.txt
gs://folder/text2.pdf

Does the Google BigQuery ML automatically makes the time series data stationary?

So I am a newbie to Google BigQuery ML and was wondering if the auto.arima automatically makes my time series data stationary ?
Suppose, I have a data that is not stationary and if I give the data as is to the auto arima model using Google BigQuery ML, will it first makes my data stationary before taking it as input ?
but that's part of the modeling procedure
From the documentation that explains What's inside a BigQuery ML time series model, it does appear that auto.ARIMA will make the data stationary.
However, I would not expect it to alter the source data table; it won't make that stationary, but in the course of building candidate models it may alter the input data prior to actual model fitting (transforms; e.g. box-cox, make stationary, etc.)

Model / View: Filtering data beforehand in database or at runtime in proxy model?

Imagine an applications that displays data from a sqlite database.
The app is making use of model/view programming.
It can have multiple views acting in parallel on different subsets of the same data (subsets made by filtering the required data types).
(Sidenote: I am using Qt, so there is no controller part, of course, but I did not find a more suitable tag.)
I am not sure which approach to take:
1a. Load all database data into one single model
1b. Then apply the model to all views, filtering the data inside the view with a proxy model
2a. One model for each view, but filtering done inside sqlite database.
Pros/Cons:
Idea 1:
(+) one model, makes use of model/view advantages (e.g. updating all connected views)
(-) memory usage could get huge because all data is loaded into a model, but only a subset is shown
Idea 2:
(+) theoeretically lower memory usage because only the filtered data is loaded from the database
(-) the views can have filters that could lead to intersecting data, meaning the same data would be stored in more than one model -> perhaps practically even bigger memory usage than in Idea 1
The data being loaded here is just case metadata, e.g. title, description, datetime and so on. Bigger data like images, files are not being loaded here. So as the database could indeed grow big (big for this kind of application, say 200 gb for power users), this does not affect the topic of the present question, because the metadata is much, much smaller and is proportional to overall data count, not data size.
Do you have practical experience with such a configuration and can suggest which one to use? It seems to me that Idea 1 is the way to go, but I am not sure about it.
In my experience, the less data is loaded from the database into memory, the better. It is not just the memory usage, but also startup time. If the data is delivered over the network, loading a few gigabytes can take forever.
So I would go for a variant of your second solution, where each table view has its own model. The model is an implementation of QAbstractItemModel that lazily fetches only the rows that currently need to be displayed. The models could, however, share a common cache. This will also make sure that they all display the same data where it intersects.

How are documents retrieved after reduce produces the output?

So, after reduce completes its job we have data stored in the files something like this:
But what happens when the user types something? How is search performed when the data is stored just in files?
MapReduce is for processing. So once you have processed the data and generated your aggregate information, which is on HDFS, you will either have to read the file in some program to display to user. Or several alternative options are available to read the data from HDFS :
You could use Hive and create a table on top of this data and read the data using SQL like queries. A simple web application can connect to this using the thrift server which provides a JDBC interface to hive.
Other options include loading data to HBase, Shark etc. All depends on what your use case is interms of the size of the aggregated data, performance requirements
What you have constructed after MapReduce is a inverted index, a nice little data structure. Now you have to use it.
For example, in case of google, this inverted index is sharded across many servers and stores the entire list on each of them. So for example, server 500 has the list for be, and another has the list for to. These are implementation details, you could theoretically store it on one box in a large hash if you could hold the index in memory.
When the customer types in words into the engine. It will retrieve that entire list. If there are multiple words, it will do an intersection of those lists to show you documents that have both words.
Here is the source for the full paper on how they did it http://infolab.stanford.edu/~backrub/google.html
See "Figure 4. Google Query Evaluation"