Anomaly detection in production - google-cloud-platform

I am trying to search for suggestions and solutions, but I am unable to find any.
After reading blogs, I am able to build a time series anomaly detection using BigQuery ML (Arima Plus).
My question is: how do I put such a model in production?
Probably I need to:
program the re-training of the model every X days
check whether there are new anomalies on the object table every X hours
record those anomalies in another table
But I also accept other suggestion on how to proceed.
Is there anyone out there that can give me any hint?
Thank you!

The best way I found is to create "scheduled queries":
schedule a query for re-training of the model every X days:
CREATE OR REPLACE MODEL mymodel
OPTIONS( model_type='arima_plus',
TIME_SERIES_DATA_COL='events',
TIME_SERIES_TIMESTAMP_COL='approx_hour',
HOLIDAY_REGION = 'GLOBAL',
CLEAN_SPIKES_AND_DIPS = FALSE,
DECOMPOSE_TIME_SERIES=TRUE)
AS (SELECT 
TIMESTAMP_TRUNC( PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ',start_time), hour) as approx_hour, 
COUNT(1) AS events 
FROM  `mytable`
GROUP BY approx_hour);
 
schedule a query to perform anomaly detection on the latest events, and eventually write them on a table:
insert into `events_anomalies_table`
SELECT approx_hour as hour,
cast(events as int64) as actual_events,
cast(lower_bound as int64) as expected_min_events,
cast(upper_bound as int64) as expected_max_events,
current_timestamp() as execution_timestamp
FROM ML.DETECT_ANOMALIES(
MODEL`my_model`,
STRUCT (0.98 AS anomaly_prob_threshold),
( SELECT
TIMESTAMP_TRUNC( PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ',start_time), hour) as approx_hour, 
COUNT(1) AS events 
FROM  `my_table`
WHERE PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ',start_time) > TIMESTAMP_SUB(CURRENT_TIMESTAMP() , INTERVAL 1 HOUR)
GROUP BY approx_hour
LIMIT 1))
WHERE is_anomaly = True

Related

Why Amazon Forecast cannot train the predictor?

While training my predictor I came across this error and I got stuck how to fix it.
I have two data-series, a "Target time-series data" with 9234 rows and a single "item_id" and a second one that is "Related time-series data" with the same number of rows as I only have a single id.
I'm setting de data with a window of 180 days, what is exactly the difference between the second and the first number that has appeared on the error, 9414 - 9234 = 180.
We were unable to train your predictor.
Please ensure there are no missing values for any items in the related time series, All items need data until 2020-03-15 00:00:00.0. For example, following items have missing data: item: brl only has 9234/9414 required datapoints starting 1994-06-07 00:00:00.0, please refer to documentation for additional details.
Once my data don't have missing data and it's on a daily basis why is it returning this error?
My data starts on 1994-06-07 and ends on 2019-09-17. Why should I have 9414 data points rather than 9234?
Should I take out 180 days in my "Target time-series data"?
The future values of the related time-series data must be known.
Example of a good related-time series: You know past and future days in which marketing has or will send email newsletters promoting the product you're forecasting. You can use this data as a related-time series.
Example of a bad related-time series: You notice that Google searches for your brand correlated with the sale of your product. As a result you want to use it as a related-time series. Since you don't know how many searches will occur in the future, so you can't use this as a related time series.
In you case, You have TARGET_TIME_SERIES data for 9414 days and you want to predict demand for the next 180 days. That means your RELATED_TIME_SERIES data should be 9594 days.
Edit: I have not tested this with amazon's forecasting product. I'm basing my answer on working with Facebook Prophet (which is one of the models amazon forcast uses). Please let me know if my solution worked.

Google AutoML Importing text items very slow

I'm importing text items to Google's AutoML. Each row contains around 5000 characters and I'm adding 70K of these rows. This is a multi-label data set. There is no progress bar or indication of how long this process will take. Its been running for a couple of hours. Is there any way to calculate time remaining or total estimated time. I'd like to add additional data sets, but I'm worried that this will be a very long process before the training even begins. Any sort of formula to create even a semi-wild guess would be great.
-Thanks!
I don't think that's possible today, but I filed a feature request [1] that you can follow for updates. I asked for both training and importing data, as for training it could be useful too.
I tried training with 50K records (~ 300 bytes/record) and the load took more than 20 mins after which I killed it. I retried with 1K, which ran for 20 mins and then emailed me an error message saying I had multiple labels per input (yes, so what? training data is going to have some of those) and I had >100 labels. I simplified the classification buckets and re-ran. It took another 20 mins and was successful. Then I ran 'training' which took 3 hours and billed me $11. That maps to $550 for 50K recs, assuming linear behavior. The prediction results were not bad for a first pass, but I got the feeling that it is throwing a super large neural net at the problem. Would help if they said what NN it was and its dimensions. They do say "beta" :)
don't wast your time trying to using google for text classification. I am a GCP hard user but microsoft LUIS is far better, precise and so much faster that I can't believe that both products are trying to solve same problem.
Luis has a much better documentation, support more languages, has a much better test interface, way faster.. I don't know if is cheaper yet because the pricing model is different but we are willing to pay more.

AWS Machine Learning Data

I'm using the AWS Machine Learning regression to predict the waiting time in a line of a restaurant, in a specific weekday/time.
Today I have around 800k data.
Example Data:
restaurantID (rowID)weekDay (categorical)time (categorical)tablePeople (numeric)waitingTime (numeric - target)1 sun 21:29 2 23
2 fri 20:13 4 43
...
I have two questions:
1)
Should I use time as Categorical or Numeric?
It's better to split into two fields: minutes and seconds?
2)
I would like in the same model to get the predictions for all my restaurants.
Example:
I expected to send the rowID identifier and it returns different predictions, based on each restaurant data (ignoring others data).
I tried, but it's returning the same prediction for any rowID. Why?
Should I have a model for each restaurant?
There are several problems with the way you set-up your model
1) Time in the form you have it should never be categorical. Your model treats times 12:29 and 12:30 as two completely independent attributes. So it will never use facts it learn about 12:29 to predict what's going to happen at 12:30. In your case you either should set time to be numeric. Not sure if amazon ML can convert it for you automatically. If not just multiply hour by 60 and add minutes to it. Another interesting thing to do is to bucketize your time, by selecting which half hour or wider interval. You do it by dividing (h*60+m) by some number depending how many buckets you want. So to try 120 to get 2 hr intervals. Generally the more data you have the smaller intervals you can have. The key is to have a lot of samples in each bucket.
2) You should really think about removing restaurantID from your input data. Having it there will cause the model to over-fit on it. So it will not be able to make predictions about restaurant with id:5 based on the facts it learn from restaurants with id:3 or id:9. Having restaurant id there might be okay if you have a lot of data about each restaurant and you don't care about extrapolating your predictions to the restaurants that are not in the training set.
3) You never send restaurantID to predict data about it. The way it usually works you need to pick what are you trying to predict. In your case probably 'waitingTime' is most useful attribute. So you need to send weekDay, time and number of people and the model will output waiting time.
You should think what is relevant for the prediction to be accurate, and you should use your domain expertise to define the features/attributes you need to have in your data.
For example, time of the day, is not just a number. From my limited understanding in restaurant, I would drop the minutes, and only focus on the hours.
I would certainly create a model for each restaurant, as the popularity of the restaurant or the type of food it is serving is having an impact on the wait time. With Amazon ML it is easy to create many models as you can build the model using the SDK, and even schedule retraining of the models using AWS Lambda (that mean automatically).
I'm not sure what the feature called tablePeople means, but a general recommendation is to have as many as possible relevant features, to get better prediction. For example, month or season is probably important as well.
In contrast with some answers to this post, I think resturantID helps and it actually gives valuable information. If you have a significant amount of data per each restaurant then you can train a model per each restaurant and get a good accuracy, but if you don't have enough data then resturantID is very informative.
1) Just imagine what if you had only two columns in your dataset: restaurantID and waitingTime. Then wouldn't you think the restaurantID from the testing data helps you to find a rough waiting time? In the simplest implementation, your waiting time per each restaurantID would be the average of waitingTime. So definitely restaurantID is a valuable information. Now that you have more features in your dataset, you need to check if restaurantID is as effective as the other features or not.
2) If you decide to keep restaurantID then you must use it as a categorical string. It should be a non-parametric feature in your dataset and maybe that's why you did not get a proper result.
On the issue with day and time I agree with other answers and considering that you are building your model for the restaurant, hourly time may give a more accurate result.

WSO2 CEP window time 1 day

I am evaluating different possible solution for ATM Card transaction fraud detection with input load of around 50000 per second and response time few seconds.
WSO2 CEP looks better fit for overall solution, but stuck with problem of memory & performance as I am new to WSO2 CEP so please suggest if there is better way of doing below in CEP WSO2/CEP.
in order to detect fraud we have to capture data aggregation over a time period of 1 day which is causing either memory overflow or performance hit.
1) Below causing out of memory as CEP tries to keep all record in memory for whole day
from instream#window.time(1 day)
select sum(amount) as totalAmt;
2) Below causing performance hit as it tries to load all records from table for doing some.
define table instream_table (....) from ( datasource,table,cache policy) ;
from instream#window.length(1) join instream_table
on instream.card_id==intable.card_id
select sum(instream_table.amount) as totalAmt;
worst thing I have noticed that CEP fires select * from instream_table instead of adding even where clause for card_id, Where I was expecting CEP to intelligent enough to fire select sum(amount) from instream_table where card_id=xxxxx
i have looked at documentation for window in WSO2 CEP but could not find any way to optimize this as it looks like WSO2 CEP tries to everything in memory.
let me know if there is any work around or better solution to achieve this. I have looked at other CEP engines like esper but seems every body is doing this in same manner.
It seems a known bug in event tables.. I have created a jira to track that issue in [1] and added a patch for that issue as well.. Will fix that issue in next release..
[1] https://wso2.org/jira/browse/CEP-866
Thanks..

SimpleDB Incremental Index

I understand SimpleDB doesn't have an auto increment but I am working on a script where I need to query the database by sending the id of the last record I've already pulled and pull all subsequent records. In a normal SQL fashion if there were 6200 records I already have 6100 of them when I run the script I query records with an ID greater than > 6100. Looking at the response object, I don't see anything I can use. It just seems like there should be a sequential index there. The other option I was thinking would be a real time stamp. Any ideas are much appreciated.
Using a timestamp was perfect for what I needed to do. I followed this article to help me on my way:http://aws.amazon.com/articles/1232 I would still welcome if anyone knows if there is a way to get an incremental index number.