custom mapping for mapper attachment type with elasticsearch-persistence ruby - elasticsearch-rails

In my project I store data in active record model and index html document in elasticsearch using mapper-attachments plugin. My document mapping look like this:
include Elasticsearch::Model
settings index: { number_of_shards: 5 } do
mappings do
indexes :alerted
indexes :title, analyzer: 'english', index_options: 'offsets'
indexes :summary, analyzer: 'english', index_options: 'offsets'
indexes :content, type: 'attachment', fields: {
author: { index: "no"},
date: { index: "no"},
content: { store: "yes",
type: "string",
term_vector: "with_positions_offsets"
}
}
end
end
I run a query to double check my doc mapping and the result:
"mappings": {
"feed_entry": {
"properties": {
"content": {
"type": "attachment",
"path": "full",
"fields": {
"content": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
},
It works great (the type: 'attachment' above). I can do the search through html doc perfectly.
I have a performance problem with activerecord which is mysql and I don't really need to store it in database so I decide to migrate to store in elasticsearch.
I am doing an experiment with elasticsearch-persistence gem.
I configure the mapping as below:
include Elasticsearch::Persistence::Model
attribute :alert_id, Integer
attribute :title, String, mapping: { analyzer: 'english' }
attribute :url, String, mapping: { analyzer: 'english' }
attribute :summary, String, mapping: { analyzer: 'english' }
attribute :alerted, Boolean, default: false, mapping: { analyzer: 'english' }
attribute :fingerprint, String, mapping: { analyzer: 'english' }
attribute :feed_id, Integer
attribute :keywords
attribute :content, nil, mapping: { type: 'attachment', fields: {
author: { index: "no"},
date: { index: "no"},
content: { store: "yes",
type: "string",
term_vector: "with_positions_offsets"
}
}
but when i do a query to mapping i got something like this:
"mappings": {
"entry": {
"properties": {
"content": {
"properties": {
"_content": {
"type": "string"
},
"_content_type": {
"type": "string"
},
"_detect_language": {
"type": "boolean"
},
which is wrong. can anyone tell me how to do a mapping with attachment type ?
Really appreciate your help.

In the mean time, I have to hard-code it this way:
def self.recreate_index!
mappings = {}
mappings[FeedEntry::ELASTIC_TYPE_NAME]= {
"properties": {
"alerted": {
"type": "boolean"
},
"title": {
#for exact match
"index": "not_analyzed",
"type": "string"
},
"url": {
"index": "not_analyzed",
"type": "string"
},
"summary": {
"analyzer": "english",
"index_options": "offsets",
"type": "string"
},
"content": {
"type": "attachment",
"fields": {
"author": {
"index": "no"
},
"date": {
"index": "no"
},
"content": {
"store": "yes",
"type": "string",
"term_vector": "with_positions_offsets"
}
}
}
}
}
options = {
index: FeedEntry::ELASTIC_INDEX_NAME,
}
self.gateway.client.indices.delete(options) rescue nil
self.gateway.client.indices.create(options.merge( body: { mappings: mappings}))
end
And then override the to_hash method
def to_hash(options={})
hash = self.as_json
map_attachment(hash) if !self.alerted
hash
end
# encode the content to Base64 formatj
def map_attachment(hash)
hash["content"] = {
"_detect_language": false,
"_language": "en",
"_indexed_chars": -1 ,
"_content_type": "text/html",
"_content": Base64.encode64(self.content)
}
hash
end
Then I have to call
FeedEntry.recreate_index!
before hand to create the mapping for elastic search. Becareful when you update the document you might end up with double base64 encoding of the content field. In my scenario, I checked the alerted field.

Related

Amplify Push erroring out when updating CustomResources.json

I'm building a geospatial search on properties using AWS Amplify and ElasticSearch.
I'm currently following this guide: https://gerard-sans.medium.com/finding-the-nearest-locations-around-you-using-aws-amplify-part-2-ce4603605be6
I set up my model as follows
type Property #model #searchable #auth(rules: [{allow: public}]) {
id: ID!
...
Loc: Coord!
}
type Coord {
lon: Float!
lat: Float!
}
I also added a custom Query:
type Query {
nearbyProperties(
location: LocationInput!,
m: Int,
limit: Int,
nextToken: String
): ModelPropertyConnection
}
input LocationInput {
lat: Float!
lon: Float!
}
type ModelPropertyConnection {
items: [Property]
total: Int
nextToken: String
}
I added resolvers for request and response:
## Query.nearbyProperties.req.vtl
## Objects of type Property will be stored in the /property index
#set( $indexPath = "/property/doc/_search" )
#set( $distance = $util.defaultIfNull($ctx.args.m, 500) )
#set( $limit = $util.defaultIfNull($ctx.args.limit, 10) )
{
"version": "2017-02-28",
"operation": "GET",
"path": "$indexPath.toLowerCase()",
"params": {
"body": {
"from" : 0,
"size" : ${limit},
"query": {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "${distance}m",
"Loc" : $util.toJson($ctx.args.location)
}
}
}
},
"sort": [{
"_geo_distance": {
"Loc": $util.toJson($ctx.args.location),
"order": "asc",
"unit": "m",
"distance_type": "arc"
}
}]
}
}
}
and response:
## Query.nearbyProperties.res.vtl
#set( $items = [] )
#foreach( $entry in $context.result.hits.hits )
#if( !$foreach.hasNext )
#set( $nextToken = "$entry.sort.get(0)" )
#end
$util.qr($items.add($entry.get("_source")))
#end
$util.toJson({
"items": $items,
"total": $ctx.result.hits.total,
"nextToken": $nextToken
})
And now the CustomStacks.json:
{
"AWSTemplateFormatVersion": "2010-09-09",
"Description": "An auto-generated nested stack.",
"Metadata": {},
"Parameters": {
"AppSyncApiId": {
"Type": "String",
"Description": "The id of the AppSync API associated with this project."
},
"AppSyncApiName": {
"Type": "String",
"Description": "The name of the AppSync API",
"Default": "AppSyncSimpleTransform"
},
"env": {
"Type": "String",
"Description": "The environment name. e.g. Dev, Test, or Production",
"Default": "NONE"
},
"S3DeploymentBucket": {
"Type": "String",
"Description": "The S3 bucket containing all deployment assets for the project."
},
"S3DeploymentRootKey": {
"Type": "String",
"Description": "An S3 key relative to the S3DeploymentBucket that points to the root\nof the deployment directory."
}
},
"Resources": {
"QueryNearbyProperties": {
"Type": "AWS::AppSync::Resolver",
"Properties": {
"ApiId": { "Ref": "AppSyncApiId" },
"DataSourceName": "ElasticSearchDomain",
"TypeName": "Query",
"FieldName": "nearbyProperties",
"RequestMappingTemplateS3Location": {
"Fn::Sub": [
"s3://${S3DeploymentBucket}/${S3DeploymentRootKey}/resolvers/Query.nearbyProperties.req.vtl", {
"S3DeploymentBucket": { "Ref": "S3DeploymentBucket" },
"S3DeploymentRootKey": { "Ref": "S3DeploymentRootKey" }
}]
},
"ResponseMappingTemplateS3Location": {
"Fn::Sub": [ "s3://${S3DeploymentBucket}/${S3DeploymentRootKey}/resolvers/Query.nearbyProperties.res.vtl", {
"S3DeploymentBucket": { "Ref": "S3DeploymentBucket" },
"S3DeploymentRootKey": { "Ref": "S3DeploymentRootKey" }
}]
}
}
}
},
"Conditions": {
"HasEnvironmentParameter": {
"Fn::Not": [
{
"Fn::Equals": [
{
"Ref": "env"
},
"NONE"
]
}
]
},
"AlwaysFalse": {
"Fn::Equals": ["true", "false"]
}
},
"Outputs": {
"EmptyOutput": {
"Description": "An empty output. You may delete this if you have at least one resource above.",
"Value": ""
}
}
}
But when i try to amplify push, it does not work. Something about: Resource is not in the state stackUpdateComplete
Any help?
You could take a look in cloudformation at the resource. You're instance is probably stuck in update. Go to : Cloudformation, select your instance (or uncheck view nested first) and go to the events tab. There you will probably find a reason why the instance can't update.
If it's stuck, cancel within the stack actions.

JSON Schema if/then/else for property that can be an object or null based on another properties value

I have a property that will be an object or null based on the value of another property. I'm trying to add this new check to my schema using an if/then/else. This is for AJV validation in Postman if that's pertinent .
For example, this sample payload
{
"topObj": {
"subItem1": "2021-09-12",
"subItem2": "2021-09-21",
"ineligibleReason": "",
"myObject": {
"subObject1": true,
"subObject2": ""
}
}
}
If ineligibleReason is an empty string then subObject should be an object. If ineligibleReason isn't an empty string then subObject should be null as in this table:
ineligibleReason value
myObject value
schema valid?
""
object
true
""
null
false
"any value"
null
true
"any value"
object
false
Here's what I have so far. jsonschema.dev thinks it's a valid schema but when I add a value to ineligibleReason in the payload (keeping myObject as an object) it still says the JSON payload is valid!
{
"type": "object",
"properties": {
"topObj": {
"type": "object",
"properties": {
"subItem1": { "type": "string" },
"subItem2": { "type": "string" },
"ineligibleReason": { "type": "string" },
"myObject": { "type": ["object", "null"] }
}
}
},
"required": ["topObj"],
"additionalProperties": false,
"if": {
"properties": { "topObj.ineligibleReason": { "const": "" } }
},
"then": {
"properties": {
"topObj.myObject": {
"type": "object",
"properties": {
"subObject1": { "type": "boolean" },
"subObject2": { "type": "string" }
},
"required": ["subObject1", "subObject2"],
"additionalProperties": false
}
}
},
"else" : {
"properties": { "myObject": { "type": "null" } }
}
}
I have this in jsonschema.dev but it gives Schema Error "/properties/required" should be object,boolean.
My basic schemas are working but I'm not sure how to add this conditional validation based on another properties value.
Update 1 I updated the schema and it's now parses as valid. However the payload validates when ineligibleReason has a value and myObject is an object instead of being null.
Update 2 I updated the schema again, moving the if/then/else to the bottom (no longer "inline"). The schema definition parses as valid, however the payload validates successfully irrespective of the invalid situations (ineligibleReason has a value and myObject is an object instead of being null).
How do I get the if/then/else to validate my subObject property correctly?
The error message is pointing to the problem. /properties/required is declaring a property named "required", and then the value under that needs to be a schema (object, or boolean). So you need to lift that "required" to be adjacent to "properties", rather than beneath it.
Answer per Ryan Miller on json-schema Slack. A slightly different tack then I was trying but simpler and works!
{
"type": "object",
"properties": {
"topObj": {
"type": "object",
"properties": {
"subItem1": { "type": "string" },
"subItem2": { "type": "string" },
"ineligibleReason": { "type": "string" },
"myObject": {
"type": ["object", "null"],
"$comment": "'properties', 'required', and 'additionalProperties' only make assertions when the instance is an object.",
"properties": {
"subObject1": { "type": "boolean" },
"subObject2": { "type": "string" }
},
"required": ["subObject1", "subObject2"],
"additionalProperties": false
}
},
"required": ["subItem1", "subItem2", "ineligibleReason", "myObject"],
"additionalProperties": false,
"if": {
"$comment": "Is an ineligibleReason defined?",
"properties": {
"ineligibleReason": {"minLength": 1}
}
},
"then": {
"$comment": "Then 'myObject' must be null.",
"properties": {
"myObject": {"type": "null"}
}
}
}
},
"required": ["topObj"],
"additionalProperties": false
}

ElasticSearch reindexing with selected fields result into addition of non selected empty field

Scenario:
We are using AWS ElasticSearch 6.8. We got an index (index-A) with a mapping structure consist of multiple nested objects and JSON hierarchy. We need to create new index (index-B) and move all documents from index-A to index-B.
We need to create index-B with only specific fields.
We need to rename field names while reindexing
e.g.
index-A mapping:
{
"userdata": {
"properties": {
"payload": {
"type": "object",
"properties": {
"Alldata": {
"Username": {
"type": "keyword"
},
"Designation": {
"type": "keyword"
},
"Company": {
"type": "keyword"
},
"Region": {
"type": "keyword"
}
}
}
}
}
}}
Expected structure of index-B mapping after reindexing with rename (Company-cnm, Region-rg) :-
{
"userdata": {
"properties": {
"cnm": {
"type": "keyword"
},
"rg": {
"type": "keyword"
}
}
}}
Steps we are Following:
First we are using Create index API to create index-B with above mapping structure
Once index is created we are creating an ingest pipeline.
PUT ElasticSearch domain endpoint/_ingest/pipeline/my_rename_pipeline
{
"description": "rename field pipeline",
"processors": [{
"rename": {
"field": "payload.Company",
"target_field": "cnm",
"ignore_missing": true
}
},
{
"rename": {
"field": "payload.Region",
"target_field": "rg",
"ignore_missing": true
}
}
]
}
Perform reindexing operation, payload for the same below
let reindexParams = {
wait_for_completion: false,
slices: "auto",
body: {
"conflicts": "proceed",
"source": {
"size": 8000,
"index": "index-A",
"_source": ["payload.Company", "payload.Region"]
},
"dest": {
"index": "index-B",
"pipeline": "my_rename_pipeline",
"version_type": "external"
}
}
};
Problem:
Once the reindexing is complete as expected all documents transferred to new index with renamed fields but there is one additional field which is not selected. As you can see below the "payload" object with metadata is also added to the new index after reindexing. This field is empty and consist of no data.
index-B looks like below after reindexing:
{
"userdata": {
"properties": {
"cnm": {
"type": "keyword"
},
"rg": {
"type": "keyword"
},
"payload": {
"properties": {
"Alldata": {
"type": "object"
}
}
}
}
}}
We are unable to find the workaround and need help how to stop this field from creating. Any help will be appreciated.
Great job!! You're almost there, you simply need to remove the payload field within your pipeline using the remove processor and you're good:
{
"description": "rename field pipeline",
"processors": [
{
"rename": {
"field": "payload.Company",
"target_field": "cnm",
"ignore_missing": true
}
},
{
"rename": {
"field": "payload.Region",
"target_field": "rg",
"ignore_missing": true
}
},
{
"remove": { <--- add this processor
"field": "payload"
}
}
]
}

ElasticSearch - copy_to with dynamic template

Following up my previous question: ElasticSearch overriding mapping from text to object
I have an index template:
{
"template" : "project.*",
"order" : 100,
"dynamic_templates": [
{
"message_field": {
"mapping": {
"type": "object"
},
"match": "message"
},
"message_properties": {
"path_match": "message.*",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
]
}
which basically creates new fields for everything under "message" field. I am doing this because "message" field is mapped as a string in another index template and I am overriding it.
Sample document:
{
"level": "30",
...
"kubernetes": {
"container_name": "data-sync-server",
"namespace_name": "alitest03",
...
},
"message": {
"tag": "AUDIT",
"requestId": 1234,
...
},
}
...
}
This works fine, but it ends up creating top level fields like "tag" and "requestId".
I don't want to pollute the top level and would like to have fields like "audit.tag", "audit.requestId".
Tried using copy_to like this, but I don't see any "audit.*" fields:
{
"template" : "project.*",
"order" : 100,
"dynamic_templates": [
{
"message_field": {
"mapping": {
"type": "object"
},
"match": "message"
},
"message_properties": {
"path_match": "message.*",
"mapping": {
"type": "string",
"index": "not_analyzed",
"copy_to" : "audit.{name}"
}
}
}
]
}
A sample search result when using the template above with copy_to is below. I don't see any "audit.*" fields.
{
"timestamp": "October 15th 2018, 15:46:15.994",
"_id": "YmI1NDRjMTgtZTY3Ni00ZGUxLTk2NDMtOTJhZjk3ZWU1YTJj",
"_index": "project.alitestproj02.aa564e69-c643-11e8-af2a-fa163e4c9c9e.2018.10.15",
"_score": "",
"_type": "com.redhat.viaq.common",
...
"kubernetes.container_name": "data-sync-server",
"kubernetes.namespace_name": "alitestproj02",
...
"message": "{\"level\":30,\"time\":1539607575994,\"pid\":19,\"hostname\":\"data-sync-server-6-pxcsm\",\"tag\":\"AUDIT\",\"msg\":\"\",\"requestId\":20355,\"operationType\":\"query\",\"parentTypeName\":\"Meme\",\"path\":\"allMemes.866.owner\",\"success\":true,\"parent\":{\"_type\":\"meme\",\"photourl\":\"photo472\",\"owner\":\"owner35\",\"likes\":0,\"_id\":\"zzEnLAQmQeuTC1mj\",\"createdAt\":\"2018-10-15T11:58:33.896Z\",\"updatedAt\":\"2018-10-15T11:58:33.896Z\",\"id\":\"zzEnLAQmQeuTC1mj\"},\"arguments\":{},\"dataSourceType\":\"InMemory\",\"v\":1}\n",
"requestId": "20355",
"tag": "AUDIT",
...
"v": 1
}

Logstash keep creating field despite the dynamic_mapping being deactivated

I have defined my own template to be used by logstash where I have deactivate the dynamic mapping:
{
"my_index": {
"order": 0,
"template": "my_index",
"settings": {
"index": {
"mapper": {
"dynamic": "false"
},
"analysis": {
"analyzer": {
"nlp_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "nlp_tokenizer"
}
},
"tokenizer": {
"nlp_tokenizer": {
"pattern": ""
"(\w+)|(\s*[\s+])"
"",
"type": "pattern"
}
}
},
"number_of_shards": "1",
"number_of_replicas": "0"
}
},
"mappings": {
"author": {
"properties": {
"author_name": {
"type": "keyword"
},
"author_pseudo": {
"type": "keyword"
},
"author_location": {
"type": "text",
"fields": {
"standard": {
"analyzer": "standard",
"term_vector": "yes",
"type": "text"
},
"nlp": {
"analyzer": "nlp_analyzer",
"term_vector": "yes",
"type": "text"
}
}
}
}
}
}
}
}
To test if elasticsearch won’t generate new field I try to let a field in my events that is not present in my mapping, let’s say that I have this event:
{
“type” => “author”,
“author_pseudo” => “chloemdelorenzo”,
“author_name” => “Chloe DeLorenzo”,
“author_location” => “US”,
}
Elasticsearch will generate a new field in the mapping when indexing this event:
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
I know that Logstash is using my template because in my mapping I use a custom analyser and I can find it back into the mapping generated. But apparently it doesn’t take into consideration that the dynamic field is disabled.
I want elasticsearch to ignore fields that are not present in my mapping but to index the field that have a defined mapping. How can I avoid logstash to create new field?
You should enforce the mapping at the document type level.
https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html
Regardless of the value of this setting, types can still be added
explicitly when creating an index or with the PUT mapping API.
So your mapping will look like:
"mappings": {
"author": {
"dynamic": false,
"properties": {
"author_name": {
"type": "keyword"
},
"author_pseudo": {
"type": "keyword"
},
"author_location": {
"type": "text",
"fields": {
"standard": {
"analyzer": "standard",
"term_vector": "yes",
"type": "text"
},
"nlp": {
"analyzer": "nlp_analyzer",
"term_vector": "yes",
"type": "text"
}
}
}
}
}
}
This answer is not exactly what you are requesting, but you can manually remove fields with a logstash filter like this:
filter {
mutate {
remove_field => ["fieldname"]
}
}
If your events have a defined list of fields, you could solve your problem this way.