How to match a string exactly OR exact substring from beginning using Regular Expression - regex

I'm trying to build a regex query for a database and it's got me stumped. If I have a string with a varying number of elements that has an ordered structure how can I find if it matches another string exactly OR some exact sub string when read from the left?
For example I have these strings
Canada.Ontario.Toronto.Downtown
Canada.Ontario
Canada.EasternCanada.Ontario.Toronto.Downtown
England.London
France.SouthFrance.Nice
They are structured by most general location to specific, left to right. However, the number of elements varies with some specifying a country.region.state and so on, and some just country.town. I need to match not only the words but the order.
So if I want to match "Canada.Ontario.Toronto.Downtown" I would want to both get #1 and #2 and nothing else. How would I do that? Basically running through the string and as soon as a different character comes up it's not a match but still allow a sub string that ends "early" to match like #2.
I've tried making groups and using "?" like (canada)?.?(Ontario)?.? etc but it doesn't seem to work in all situations since it can match nothing as well.
Edit as requested:
Mongodb Database Collection:
[
{
"_id": "doc1",
"context": "Canada.Ontario.Toronto.Downtown",
"useful_data": "Some Data"
},
{
"_id": "doc2",
"context": "Canada.Ontario",
"useful_data": "Some Data"
},
{
"_id": "doc3",
"context": "Canada.EasternCanada.Ontario.Toronto.Downtown",
"useful_data": "Some Data"
},
{
"_id": "doc4",
"context": "England.London",
"useful_data": "Some Data"
},
{
"_id": "doc5",
"context": "France.SouthFrance.Nice",
"useful_data": "Some Data"
},
{
"_id": "doc6",
"context": "",
"useful_data": "Some Data"
}
]
User provides "Canada", "Ontario", "Toronto", and "Downtown" values in that order and I need to use that to query doc1 and doc2 and no others. So I need a regex pattern to put in here: collection.find({"context": {$regex: <pattern here>}) If it's not possible I'll just have to restructure the data and use different methods of finding those docs.

At each dot, start an nested optional group for the next term, and add start and end anchors:
^Canada(\.Ontario(\.Toronto(\.Downtown)?)?)?$
See live demo.

Related

How can I find and replace values bracket by bracket?

I'm trying to find every "color" value and replace it with a specific string, but only the "color" value of every "name" that has Bismuthinite" in it.
[
{
"name": "Poor Gneiss Bismuthinite",
"blockName": "tfc:ore/poor_bismuthinite/gneiss",
"order": 789,
"color": 5015620,
"drawing": false
},
{
"name": "Slate Halite",
"blockName": "tfc:ore/halite/slate",
"order": 1046,
"color": 7153517,
"drawing": false
},
The information wthin the next brackets (block? im not sure what the terminology is, i'm very new to coding in general) should not be selected or altered in any way. Only the information that matches "name" includes Bismuthinite" .
I've tried using a multiline find and replace using the ToolBucket plugin for Notepad++, but either it won't accomplish what I want it to do, or I just don't know how.

Regexp starts with not working Elasticsearch 6.*

I got trouble with understanding regexp mechanizm in ElasticSearch. I have documents that represent property units:
{
"Unit" :
{
"DailyAvailablity" :
"UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
}
DailyAvailability field codes availability of property by days for the next two years from today. 'A' means available, 'U' unabailable, 'I' can check in, 'O' can check out. How can I write regexp filter to get all units that are available in particular dates?
I tried to find the 'A' substring with particular length and offset in DailyAvailability field. For example to find units that would be available for 7 days in 7 days from today:
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
}
]
}
}
}
This query returns for instance unit with DateAvailability that starts from "UUUUUUUUUUUUUUUUUUUIAA", but contains suitable sequences somehere inside the field. How can I anchor regexp for entire source string? ES docs say that lucene regex should be anchored by default.
P.S. I have tried '^.{7}a{7}.*$'. Returns empty set.
It looks like you are using text datatype to store Unit.DailyAvailability (which is also the default one for strings if you are using dynamic mapping). You should consider using keyword datatype instead.
Let me explain in a bit more detail.
Why does my regex match something in the middle of a text field?
What happens with text datatype is that the data gets analyzed for full-text search. It does some transformations like lowercasing and splitting into tokens.
Let's try to use the Analyze API against your input:
POST _analyze
{
"text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
The response is:
{
"tokens": [
{
"token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
"start_offset": 0,
"end_offset": 255,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 255,
"end_offset": 510,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 510,
"end_offset": 732,
"type": "<ALPHANUM>",
"position": 2
}
]
}
As you can see, Elasticsearch has split your input into three tokens and lowercased them. This looks unexpected, but if you think that it actually tries to facilitate search for words in human language, it makes sense - there are no such long words.
That's why now regexp query ".{7}a{7}.*" will match: there is a token that actually starts with a lot of a's, which is an expected behavior of regexp query.
...Elasticsearch will apply the regexp to the terms produced by the
tokenizer for that field, and not to the original text of the field.
How can I make regexp query consider the entire string?
It is very simple: do not apply analyzers. The type keyword stores the string you provide as is.
With a mapping like this:
PUT my_regexes
{
"mappings": {
"doc": {
"properties": {
"Unit": {
"properties": {
"DailyAvailablity": {
"type": "keyword"
}
}
}
}
}
}
}
You will be able to do a query like this that will match the document from the post:
POST my_regexes/doc/_search
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*" }
}
]
}
}
}
Note that the query became case-sensitive because the field is not analyzed.
This regexp won't return any results anymore: ".{12}a{7}.*"
This will: ".{12}A{7}.*"
So what about anchoring?
The regexes are anchored:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
The reason why it looked like the anchoring was wrong was most likely because tokens got split in an analyzed text field.
Just in addition to brilliant and helpfull answer of Nikolay Vasiliev. In my case I was forced to go farther to make it work on NEST .net. I added attribute mapping to DailyAvailability:
[Keyword(Name = "DailyAvailability")]
public string DailyAvailability { get; set; }
The filter still didn't work and I got mapping:
"DailyAvailability":"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
My field contained about 732 symbols so it was ignored by index. I tried:
[Keyword(Name = "DailyAvailability", IgnoreAbove = 1024)]
public string DailyAvailability { get; set; }
It didn't make any difference on mapping. And only after adding manual mappings it started working properly:
var client = new ElasticClient(settings);
client.CreateIndex("vrp", c => c
.Mappings(ms => ms.Map<Unit>(m => m
.Properties(ps => ps
.Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))
)
)
));
The point is that:
ignore_above - Do not index any string longer than this value. Defaults to 2147483647 so that all values would be accepted. Please however note that default dynamic mapping rules create a sub keyword field that overrides this default by setting ignore_above: 256.
So use explicit mapping for long keyword fields to set ignore_above if you need to filter them with regexp.
For anyone could be useful, the ES tool does not support the \d \w modes, you should write those as [0-9] and [a-z]

Elasticsearch aggregation to extract pattern and occurrences

I have trouble formulating what I'm looking for so I'll use an example:
You put 3 documents in elasticsearch all with a field "name" containing these values: "test", "superTest51", "stvv".
Is it possible to extract a regular expression like pattern with the occurrences? In this case:
"xxxx": 2 occurrences
"x{5}Xxxx99": 1 occurrence
I've read some things about analyzers, but I don't think that's what I'm looking for.
Edit: To make the question clearer: I don't want to search for a regex pattern, I want to do an aggregate on a regular expression replaced field. For example replace [a-z] with x. Is the best way really to do the regular expression replace outside of elasticsearch?
Based on the formulation of your request, not sure this will match what you are looking for, but assuming you mean to search based on regex ,
following should be what you are looking for:
wildcard and regexp queries
Do take note that the behavior will be different whether the field targeted is analyzed or not.
Typically if you went with the vanilla setup of Elasticsearch as most people to start, your field will likely be analyzed, you can check your the events mapping in your indices to confirm that.
Based on your example and assuming you have a not_analyzed name field:
GET _search
{
"query": {
"regexp": {
"name": "[a-z]{4}"
}
}
}
GET _search
{
"query": {
"regexp": {
"name": "[a-z]{5}[A-Z][a-z]{3}[0-9]{2}"
}
}
}
Based on your update, and a quick search (am not that familiar with aggregations), could be something like the following would match your expectations:
GET _search
{
"size": 0,
"aggs": {
"regmatch": {
"filters": {
"filters": {
"xxxx": {
"regexp": {
"name": "[a-z]{4}"
}
},
"x{5}Xxxx99": {
"regexp": {
"name": "[a-z]{5}[A-Z][a-z]{3}[0-9]{2}"
}
}
}
}
}
}
}
This will give you 3 counts:
- total number of events
- number of first regex match
- number of second regex match

Replace specific part of a string

I'm currently encountering a pickle in modifications of a document. Lets say for example, I have this chunk of text:
"id": "EFM",
"type": "Casual",
"hasBeenAssigned": false,
"hasRandomAssigned": false
},
I currently have roughly 73 - 80 occourances of:
"id" : "somethingdifferent",
Using a regular expression in notepad++, How can I select the entire string:
"id" : "",
but only change the contents between the second set of quotes?
Edit
An oversight made me leave this information out:
"equipedOutfit": {
"id": "MkIV",
"type": "Outfit",
"hasBeenAssigned": false,
"hasRandonAssigned": false
},
"equipedWeapon": {
"id": "EFM",
"type": "Casual",
"hasBeenAssigned": false,
"hasRandonAssigned": false
},
The selected text, looking for is:
"id" : "EFM",
You can use a regex like this:
("id": ").*?"
With a replacement string:
$1whatever"
^^^^^^^^--- replace 'whatever' with whatever you want
Working demo
Update: as you updated your question, I'm updating the answer. If you want only to replace "id": "EFM" then you have just to look for that text only and put the replacement string you want.
"id":\s*"\K[^"]*
You can use \K here and replace by whatever you want.See demo.
https://regex101.com/r/sS2dM8/29
EDIT:
If you want only EFM then use
"id"\s*:\s*"\KEFM(?=")
Find what: ("id"\s?:\s?").*(")
Replace with: \1somethingdifferent\2
Options:
Regular expression, Wrap around

How to wisely combine shingles and edgeNgram to provide flexible full text search?

We have an OData-compliant API that delegates some of its full text search needs to an Elasticsearch cluster.
Since OData expressions can get quite complex, we decided to simply translate them into their equivalent Lucene query syntax and feed it into a query_string query.
We do support some text-related OData filter expressions, such as:
startswith(field,'bla')
endswith(field,'bla')
substringof('bla',field)
name eq 'bla'
The fields we're matching against can be analyzed, not_analyzed or both (i.e. via a multi-field).
The searched text can be a single token (e.g. table), only a part thereof (e.g. tab), or several tokens (e.g. table 1., table 10, etc).
The search must be case-insensitive.
Here are some examples of the behavior we need to support:
startswith(name,'table 1') must match "Table 1", "table 100", "Table 1.5", "table 112 upper level"
endswith(name,'table 1') must match "Room 1, Table 1", "Subtable 1", "table 1", "Jeff table 1"
substringof('table 1',name) must match "Big Table 1 back", "table 1", "Table 1", "Small Table12"
name eq 'table 1' must match "Table 1", "TABLE 1", "table 1"
So basically, we take the user input (i.e. what is passed into the 2nd parameter of startswith/endswith, resp. the 1st parameter of substringof, resp. the right-hand side value of the eq) and try to match it exactly, whether the tokens fully match or only partially.
Right now, we're getting away with a clumsy solution highlighted below which works pretty well, but is far from being ideal.
In our query_string, we match against a not_analyzed field using the Regular Expression syntax. Since the field is not_analyzed and the search must be case-insensitive, we do our own tokenizing while preparing the regular expression to feed into the query in order to come up with something like this, i.e. this is equivalent to the OData filter endswith(name,'table 8') (=> match all documents whose name ends with "table 8")
"query": {
"query_string": {
"query": "name.raw:/.*(T|t)(A|a)(B|b)(L|l)(E|e) 8/",
"lowercase_expanded_terms": false,
"analyze_wildcard": true
}
}
So, even though, this solution works pretty well and the performance is not too bad (which came out as a surprise), we'd like to do it differently and leverage the full power of analyzers in order to shift all this burden at indexing time instead of searching time. However, since reindexing all our data will take weeks, we'd like to first investigate if there's a good combination of token filters and analyzers that would help us achieve the same search requirements enumerated above.
My thinking is that the ideal solution would contain some wise mix of shingles (i.e. several tokens together) and edge-nGram (i.e. to match at the start or end of a token). What I'm not sure of, though, is whether it is possible to make them work together in order to match several tokens, where one of the tokens might not be fully input by the user). For instance, if the indexed name field is "Big Table 123", I need substringof('table 1',name) to match it, so "table" is a fully matched token, while "1" is only a prefix of the next token.
Thanks in advance for sharing your braincells on this one.
UPDATE 1: after testing Andrei's solution
=> Exact match (eq) and startswith work perfectly.
A. endswith glitches
Searching for substringof('table 112', name) yields 107 docs. Searching for a more specific case such as endswith(name, 'table 112') yields 1525 docs, while it should yield less docs (suffix matches should be a subset of substring matches). Checking in more depth I've found some mismatches, such as "Social Club, Table 12" (doesn't contain "112") or "Order 312" (contains neither "table" nor "112"). I guess it's because they end with "12" and that's a valid gram for the token "112", hence the match.
B. substringof glitches
Searching for substringof('table',name) matches "Party table", "Alex on big table" but doesn't match "Table 1", "table 112", etc. Searching for substringof('tabl',name) doesn't match anything
UPDATE 2
It was sort of implied but I forgot to explicitely mention that the solution will have to work with the query_string query, mainly due to the fact that the OData expressions (however complex they might be) will keep getting translated into their Lucene equivalent. I'm aware that we're trading off the power of the Elasticsearch Query DSL with the Lucene's query syntax, which is a bit less powerful and less expressive, but that's something that we can't really change. We're pretty d**n close, though!
UPDATE 3 (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
This is an interesting use case. Here's my take:
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
},
"my_edge_ngram_analyzer": {
"tokenizer": "my_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"my_reverse_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","substring","reverse"]
},
"lowercase_keyword": {
"type": "custom",
"filter": ["lowercase"],
"tokenizer": "keyword"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "25"
},
"my_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_ngram_analyzer",
"fields": {
"starts_with": {
"type": "string",
"analyzer": "my_edge_ngram_analyzer"
},
"ends_with": {
"type": "string",
"analyzer": "my_reverse_edge_ngram_analyzer"
},
"exact_case_insensitive_match": {
"type": "string",
"analyzer": "lowercase_keyword"
}
}
}
}
}
}
}
my_ngram_analyzer is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars. lowercase is used since you said case-insensitive. Basically, this is the tokenizer used for substringof('table 1',name). The query is simple:
{
"query": {
"term": {
"text": {
"value": "table 1"
}
}
}
}
my_edge_ngram_analyzer is used to split the text starting from the beginning and this is specifically used for the startswith(name,'table 1') use case. Again, the query is simple:
{
"query": {
"term": {
"text.starts_with": {
"value": "table 1"
}
}
}
}
I found this the most tricky part - the one for endswith(name,'table 1'). For this I defined my_reverse_edge_ngram_analyzer which uses a keyword tokenizer together with lowercase and an edgeNGram filter preceded and followed by a reverse filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regular edgeNGram).
The query:
{
"query": {
"term": {
"text.ends_with": {
"value": "table 1"
}
}
}
}
for the name eq 'table 1' case, a simple keyword tokenizer together with a lowercase filter should do it
The query:
{
"query": {
"term": {
"text.exact_case_insensitive_match": {
"value": "table 1"
}
}
}
}
Regarding query_string, this changes the solution a bit, because I was counting on term to not analyze the input text and to match it exactly with one of the terms in the index.
But this can be "simulated" with query_string if the appropriate analyzer is specified for it.
The solution would be a set of queries like the following (always use that analyzer, changing only the field name):
{
"query": {
"query_string": {
"query": "text.starts_with:(\"table 1\")",
"analyzer": "lowercase_keyword"
}
}
}