OpenSearch Anomaly Detector - Cannot preview historical data - amazon-web-services

I have roughly 1400 logs in my OpenSearch index. Roughly 835 are historical, and now I am continuously ingesting using Kinesis Firehose. Rows in my index
When I am creating a detector I am not able to preview this data, I get the following alert:
"No sample anomaly result generated. Please check the detector interval and make sure you have >400 data points during the preview date range."Error screenshot
My detector interval- is 3 minutes, window delay- is 1 minute, and window size- is 8.
My data is in the following format, Log data
Historical data can also be found here: Dataset
I did not get an anomaly here:
Feature Breakdown
I also have a template for my index:
PUT _template/logs
{
"index_patterns": ["logs*"],
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "epoch_second"
}
}
}
}
This is the first time I am working with anomaly detectors not sure if this is expected or I am missing something.

Related

AWS Elasticsearch performance issue

Have an index which is search heavy. Rpm varies from 15-20k. Issue is, for first few days resp time of search query will be around 15ms. But it will start increasing gradually and touches ~70ms. Some of the requests starts queuing(as per Search thread pool graph in aws console) but there were no rejection. Queuing would increase latency of the search request.
Got to know that queuing will happen if there is pressure on resource. I think I have sufficient cpu and memory, plz look at config below.
Enabled slow query logs, but didnt find any anamoly. Even though average resp time is around 16ms, I see few queries going above 50ms. But there was no issue in search query. Searchable documents is around 8k.
Need your suggestion on how to improve performance here. Document mapping, search query and ES config are given below. Is there any issue in mapping or query here?
Mapping:
{
"data":{
"mappings":{
"_doc":{
"properties":{
"a":{
"type":"keyword"
},
"b":{
"type":"keyword"
}
}
}
}
}
}
Search query:
{
"size":5000,
"query":{
"bool":{
"filter":[
{
"terms":{
"a":[
"all",
"abc"
],
"boost":1
}
},
{
"terms":{
"b":[
"all",
123
],
"boost":1
}
}
],
"adjust_pure_negative":true,
"boost":1
}
},
"stored_fields":[]
}
Im using keyword in mapping and terms in search query as I want to search for exact value.Boost and adjust_pure_negative are added automatically. From what I read, they should not affect performance.
Index settings:
{
"data":{
"settings":{
"index":{
"number_of_shards":"1",
"provided_name":"data",
"creation_date":"12345678154072",
"number_of_replicas":"7",
"uuid":"3asd233Q9KkE-2ndu344",
"version":{
"created":"10499"
}
}
}
}
}
ES config:
Master node instance type: m5.large.search
Master nodes: 3
Data node instance type: m5.2xlarge.search
Data nodes: 8 (8 vcpu, 32 GB memory)

timeout with couchdb mapReduce when database is huge

Details:
Apache CouchDB v. 3.1.1
about 5 GB of twitter data have been dumped in partitions
Map reduce function that I have written:
{
"_id": "_design/Info",
"_rev": "13-c943aaf3b77b970f4e787be600dd240e",
"views": {
"trial-view": {
"map": "function (doc) {\n emit(doc.account_name, 1);\n}",
"reduce": "_count"
}
},
"language": "javascript",
"options": {
"partitioned": true
}
}
when I am trying the following command in postman:
http://<server_ip>:5984/mydb/_partition/partition1/_design/Info/_view/trial-view?key="BT"&group=true
I am getting following error:
{
"error": "timeout",
"reason": "The request could not be processed in a reasonable amount of time."
}
Kindly help me how to apply mapReduce on such huge data?
So, I thought of answering my own question, after realizing my mistake. The answer to this is simple. It just needed more time, as the indexing takes a lot of time. you can see the metadata to see the db data being indexed.

Store invalid JSON columns are STRING or skip them in BigQuery

I have a JSON data file which looks something like below
{
"key_a": "value_a",
"key_b": "value_b",
"key_c": {
"c_nested/invalid.key.according.to.bigquery": "valid_value_though"
}
}
As we know BigQuery considers c_nested/invalid.key.according.to.bigquery as an invalid column name. I have a huge amount of log data exported by StackDriver into Google Cloud Storage which has a lot of invalid fields (according to BigQuery Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long).
As a workaround, I am trying to store the value to the key_c (the whole {"c_nested/invalid.key.according.to.bigquery": "valid_value_though"} thing) as a string in the BigQuery table.
I presume my table definition would look something like below:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_c",
"type": "STRING"
}
]
When I try to create a table with this schema I get the below error:
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 0: Expected key
Assuming it is now supported in BigQuery, I thought of simply skipping the key_c column with the below schema:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
}
]
The above schema lets me at least create a permanent table (for querying external data), but when I am trying to query the data I get the following error:
Error while reading table:
projectname.dataset_name.table_name, error message:
JSON parsing error in row starting at position 0: No such field: key_c.
I understand there is a way described here to load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery but hat makes the queries too complicated.
Is cleaning the data the only way? How can I tackle this?
I am looking for a way to skip making a column for invalid fields and store then directly as STRING or simply ignore them fully. Is this possible?
One of the main premise why people use BQ (and other cloud databases) is that storage is cheap. In practice, it is often helpful to load 'raw' or 'source' data into BQ and then transform it as needed (views or other transformation tools). This is a paradigm shift from ETL to ELT.
With that in mind, I would import your "invalid" JSON blob as a string, and then parse it in your transformation steps. Here is one method:
with data as (select '{"key_a":"value_a","key_b":"value_b","key_c":{"c_nested/invalid.key.according.to.bigquery":"valid_value_though"}}' as my_string)
select
JSON_EXTRACT_SCALAR(my_string,'$.key_a') as key_a,
JSON_EXTRACT_SCALAR(my_string,'$.key_b') as key_b,
JSON_EXTRACT_SCALAR(REPLACE(my_string,"c_nested/invalid.key.according.to.bigquery","custom_key"),'$.key_c.custom_key') as key_c
from data

AWS - How to obtain Billing Monthly Forecast programmatically

I'm just wondering if it is currently possible to obtain the billing monthly forecast amount using either an SDK or the API.
Looking at the AWS docs it doesn't seem possible. Although I haven't delved into the Cost Explorer API too much, I was wondering if anyone else has been able to obtain this data point?
There is a GetCostAndUsage method in AWS Billing and Cost Management API which returns the cost and usage metrics. This method also accepts the TimePeriod which return the results as per given time frame. Although I didn't test it but you can try to pass future dates in it maybe it will return forecast results. Give it a try
{
"TimePeriod": {
"Start":"2018-06-01",
"End": "2018-06-30"
},
"Granularity": "MONTHLY",
"Filter": {
"Dimensions": {
"Key": "SERVICE",
"Values": [
"Amazon Simple Storage Service"
]
}
},
"GroupBy":[
{
"Type":"DIMENSION",
"Key":"SERVICE"
},
{
"Type":"TAG",
"Key":"Environment"
}
],
"Metrics":["BlendedCost", "UnblendedCost", "UsageQuantity"]
}

IBM Visual Recognition Classifier status failed

I have the following IBM Watson Visual Recognition Python SDK for creating a simple classifier:
with open(os.path.dirname("/home/xxx/Desktop/Husky.zip/"), 'rb') as dogs, \
open(os.path.dirname("/home/xxx/Desktop/Husky.zip/"), 'rb') as cats:
print(json.dumps(visual_recognition.create_classifier('Dogs Vs Cats',dogs_positive_examples=dogs,negative_examples=cats), indent=2))
The response with the new classifier ID and its status is as follows:
{
"status": "training",
"name": "Dogs Vs Cats",
"created": "2016-06-23T06:30:00.115Z",
"classes": [
{
"class": "dogs"
}
],
"owner": "840ad7db-1e17-47bd-9961-fc43f35d2ad0",
"classifier_id": "DogsVsCats_250748237"
}
The training status shows failed.
print(json.dumps(visual_recognition.list_classifiers(), indent=4))
{
"classifiers": [
{
"status": "failed",
"classifier_id": "DogsVsCats_250748237",
"name": "Dogs Vs Cats"
}
]
}
What is the cause of this?
with open(os.path.dirname("/home/xxx/Desktop/Husky.zip/"), 'rb') as dogs, \
open(os.path.dirname("/home/xxx/Desktop/Husky.zip/"), 'rb') as cats:
print(json.dumps(visual_recognition.create_classifier('Dogs Vs Cats',dogs_positive_examples=dogs,negative_examples=cats), indent=2))
You are sending the same file contents "Husky.zip" for the service to use as both the positive and negative examples. However, the system requires at least 10 positive examples and 10 negative example images which are unique. The service compares the hashcode of the image file contents before training, and leaves any duplicates only in the positive set. So, your negative set is empty after de-duplication, leading to training failure. There should be an additional field called "explanation" in the verbose listing of your classifier details, saying this may be the problem.
There are size limitations for training calls and data:
The service accepts a maximum of 10,000 images or 100 MB per .zip file
The service requires a minimum of 10 images per .zip file.
The service accepts a maximum of 256 MB per training call.
There are also size limitations for classification calls:
The POST /v3/classify methods accept a maximum of 20 images per batch.
The POST /v3/detect_faces methods accept a maximum of 15 images per batch.
The POST /v3/recognize_text methods accept a maximum of 10 images per batch.
see http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/visual-recognition/customizing.shtml