why wildcard cannot use `#` in Elasticsearch? - regex

I want to use wildcard to search email in Elasticsearch.
For example:
{
"query": {
"wildcard": {
"email": "*yahoo*"
}
}
}
I can get all contains yahoo emails. But if I search like this, no document return.
{
"query": {
"wildcard": {
"email": "*#yahoo*"
}
}
}
I don't understand why like this. Anyone can help me?
Thanks in advance!

Standard Analyzer is the culprit in your case.
email field in your index seems to be analyzed string. So when you index it it will split into somemail , yahoo.com and these two tokens will be saved in reverse index. That's why you were not able to search with #yahoo.
You can use analyze api to see how your term is getting tokenized.
curl -XGET "http://localhost:9200/_analyze?tokenizer=standard" -d "test#yahoo.com"
You will get following output:
{"tokens":[{"token":"test","start_offset":0,"end_offset":4,"type":"<ALPHANUM>","position":0},{"token":"yahoo.com","start_offset":5,"end_offset":13,"type":"<ALPHANUM>","position":1}]}
You can use uax_url_email if you want to search with #yahoo
Hope this helps!!

Related

Not able to get desired search results in ElasticSearch search api

I have field "xyz" on which i want to search. The type of the field is keyword. The different values of the field "xyz "are -
a/b/c/d
a/b/c/e
a/b/f/g
a/b/f/h
Now for the following query -
{
"query": {
"query_string" : {
"query" : "(xyz:(\"a/b/c\"*))"
}
}
}
I should only get these two results -
a/b/c/d
a/b/c/e
but i get all the four results -
a/b/c/d
a/b/c/e
a/b/f/g
a/b/f/h
Edit -
Actually i am not directly querying on ElasticSearch, I am using this API https://atlas.apache.org/api/v2/resource_DiscoveryREST.html#resource_DiscoveryREST_searchWithParameters_POST which creates the above mentioned query for elasticsearch, so i dont have much control over the elasticsearch query_string. What i can change is the elasticsearch analyzer for this field or it's type.
You'll need to let the query_string parser know you'll be using regex so wrap the whole thing in /.../ and escape the forward slashes:
{
"query": {
"query_string": {
"query": "xyz:/(a\\/b\\/c\\/.*)/"
}
}
}
Or, you might as well use a regexp query:
{
"query": {
"regexp": {
"xyz": "a/b/c/.*"
}
}
}

How the users can access my Elasticsearch database in my Django SaaS?

Let's say that I have a SaaS based on Django backend that processes the data of the users and write everything to the Elasticsearch. Now I would like to give users access to search and request their data stored in ES using all possible search requests available in ES. Obviously the user should have only access to his data, not to other user's data. I am aware that it can be done in a lot of different ways but I wonder what is safe and the best solution? At this point I store everything in one index and type in the way shown below but I can do this in any way.
"_index": "example_index",
"_type": "example_type",
"_id": "H2s-lGsdshEzmewdKtL",
"_score": 1,
"_source": {
"user_id": 1,
"field1": "example1",
"field2": "example2",
"field3": "example3"
}
I think that the best way would be to associate every document with the user_id. The user would send for example GET request with body and authorization header with Token. I would use Token to extract id of the user for example in this way
key = request.META.get('HTTP_AUTHORIZATION').split()[1]
user_id = Token.objects.get(key=key).user_id
After this I would redirect his request to ES and only data that meet requirements and belongs to this user would be returned. Of course I could do this like shown above where I also add field user_id. For example I could use post_filter in this way:
To every request I would add something like this:
,
"post_filter": {
"match": {
"user_id": 1
}
}
For example the user sends GET with body
{
"query": {
"regexp": {
"tag": ".*example.*"
}
}
}
and I change this in my backend and redirect request to ES with body:
{
"query": {
"regexp": {
"tag": ".*example.*"
}
},
"post_filter": {
"match": {
"user_id": 1
}
}
}
but it doesn't seem to me that including this field in _source is a good idea. I am almost sure that it can be solved in a more optimal way than post_filtering. I see a lot of information about authorization in ES however I can’t find how can I associate document with user_id and then search only his documents without post_filtering. Any ideas?
UPDATE
My current solution looks in they way shown below however as I mentioned I believe that it is not optimal way. If anyone has an idea how can I solve this in the way described above I will be grateful for help.
I send for example
{
"query": {
"regexp": {
"tag": ".*test.*"
}
}
}
In Django backend I just do
key = request.META.get('HTTP_AUTHORIZATION').split()[1]
user_id = Token.objects.get(key=key).user_id
body = json.loads(request.body)
body['post_filter'] = {"match": {"user_id": user_id}}
res = es.search(index="pictures", doc_type="picture", body=body)
output = []
for hit in res['hits']['hits']:
output.append(hit["_source"])
return Response(
{'output': output},
status=status.HTTP_200_OK)
In elasticsearch 7.1, you have now basic security in the free version of elasticsearch. Thanks to that, you can control per indice thé Access of your user.

Issues with regex in Kibana

I am having a hard time using a regex pattern inside Kibana/Elasticsearch version 6.5.4. The field I am searching for has the following mapping:
"field": {
"type": "text",
"analyzer": "custom_analyzer"
},
Regex searches in this field return several hits when requested straight to elasticsearch:
GET /my_index/_search
{
"query": {
"regexp":{
"field": "abc[0-9]{4}"
}
}
}
On the other hand, in Kibana's discover/dashboard pages all queries below return empty:
original query - field:/abc[0-9]{4}/
scaped query - field:/abc\[0\-9\]\{4\}/
desperate query - field:/.*/
Inspecting the request done by kibana to elasticsearch reveals the following query:
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "field:/abc[0-9]{4}/",
"analyze_wildcard": true,
"default_field": "*"
}
}
I expected kibana to understand the double forward slash syntax /my_query/ and make a ´regexp query´ instead of a ´query_string´. I have tried this with both query languages: "lucene", "kuery" and with the optional "experimental query features" enabled/disabled.
Digging further I found this old issue which says that elastic only runs regex into the now deprecated _all field. If this still holds true I am not sure how regex work in kibana/elastic 6.X.
What am I missing? Any help in clarifying the conditions to use regex in Kibana would be much appreciated
All other stack questions in this subject are either old or were related to syntax issues and/or lack of understanding of how the analyzer deals with whitespaces and did not provide me any help.
So I don't exactly have the answer on how to make Lucene work with Regexp search in Kibana. But I figured out a way to do this in Kibana.
Solution is to use Filter with custom DSL
Here is an example of what to put in Query JSON -
{
"regexp": {
"req.url.keyword": "/question/[0-9]+/answer"
}
}
Example Url I have in my data - /questions/432142/answer
Additional to this, you can write more filters using Kibana search (Lucene syntax)
It does the appropriate search, no escaping issue or any such thing.
Hope it helps.
Make sure Kibana hasn't got query feature turned on in top right.

Elasticsearch Update Doc String Replacement

I have some documents on my Elasticsearch. I want to update my document contents by using String Regexp.
For example, I would like to replace all http words into https words, is it possible ?
Thank You
This should get you off to a start. Check out the "Update by Query" API here. The API allows you to include the update script and search query in the same request body.
Regarding your case, an example might look like this...
POST addresses/_update_by_query
{
"script":
{
"lang": "painless",
"inline": "ctx._source.data.url = ctx._source.data.url.replace('http', 'https')"
},
"query":
{
"query_string":
{
"query": "http://*",
"analyze_wildcard": true
}
}
}
Pretty self explanatory, but script is where we do the update, and query returns the documents to update.
Painless supports regex so you're in luck, look here for some examples, and update the inline value accordingly.

Elasticsearch - behavior of regexp query against a non-analyzed field

What is the default behavior for a regexp query against a non-analyzed field? Also, is that the same answer when dealing with .raw fields?
After everything i've read, i understand the following.
1. RegExp queries will work on analyzed and non-analyzed fields.
2. A regexp query should work across the entire phrase rather than just matching on a single token in non-analyzed fields.
Here's the problem though. I can not actually get this to work. I've tried it across multiple fields.
The setup i'm working with is a stock elk install and i'm dumping pfsense and snort logs into it with a basic parser. I'm currently on Kibana 4.3 and ES 2.1
I did a query to look at the mapping for one of the fields and it indicates it is not_analyzed, yet the regex does not work across the entire field.
"description": {
"type": "string",
"norms": {
"enabled": false
},
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
}
What am i missing here?
if a field is non-analyzed, the field is only a single token.
It's same answer when dealing with .raw fields, at least in my work.
You can use groovy script:
matcher = (doc[fields.raw].value =~ /${pattern}/ );
if(matcher.matches()) {
matcher.group(matchname)}
you can pass pattern and matchname in params.
What's meaning of tried it across multiple fields.? If your situation is more complex, maybe you could make a native java plugin.
UPDATE
{
"script_fields" : {
"regexp_field" : {
"script" : "matcher = (doc[fieldname].value =~ /${pattern}/ );if(matcher.matches()) {matcher.group(matchname)}",
"params" : {
"pattern" : "your pattern",
"matchname" : "your match",
"fieldname" : "fields.raw"
}
}
}
}