ElasticSearch Regexp Filter - regex

I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all for now, but this will eventually be match_phrase with the user's query string.
POST to localhost:9200/_search
{
"query" : {
"match_all" : { },
"filtered" : {
"filter" : {
"regexp": {
"url":".*info-for/media.*"
}
}
}
},
}
This returns 0 hits, but does parse correctly. .*info.* does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?

First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.* is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.
This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/
Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media-term in the dictionary for the regexp to match.
What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.
Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis
I.e. /foo/bar/baz generates the tokens /foo/bar/baz, /foo/bar, /foo and the domain foo.example.com is tokenized to foo.example.com, example.com, com
A search for anything in below /foo/bar could then be a simple term filter matching path:/foo/bar. That's a massively more performant filter, which can also be cached.

Related

regex - How to find and replace one variable with multiple random variables?

I have a list of URLs all from the same domain. Example :
domain1.com/urlA
domain1.com/urlB
domain1.com/urlC
domain1.com/urlD
................
i want to replace domain1 with multiple random domains (domain2,domain3, etc)
the results should be something like :
domain2.com/urlA
domain3.com/urlB
domain4.com/urlC
domain2.com/urlD
...............
I'm totally a newbie to regex. I searched 2 hours on the internet and couldn't find a solution to this!
Don't know what language you're using for this problem, but if you must use regexes, the npm package randexp looks like a good and clean solution. This package contains functions that generate a random string matching some Regular Expression Javascript object.
So, you can use some string manipulation functions to replace "domain1" with the strings generated by randexp like this:
var RandExp = require('randexp');
// returns a String of the format "domain" then a random digit 0-9 at the back
// e.g. "domain0", "domain1"..., "domain9"
console.log(new RandExp(/domain\d/).gen())
If not, I suggest just removing the string "domain1" from the first part of the URL and replacing that with the word "domain" concatenated with a randomly generated digit.

Kibana Regular expression search

I am newbie to ELK. I want to search for docs based on order of occurrence of words in a field. For example,
In doc1, my_field: "MY FOO WORD BAR EXAMPLE" In doc2, my_field: "MY BAR WORD FOO EXAMPLE"
I would like to query in Kibana for docs where "FOO" is followed by "BAR" and not the opposite. So, I would like doc1 to return in this case and not doc2.
I tried using below query in Kibana search. But, it is not working. This query doesn't even produce any search results.
my_field.raw:/.*FOO.*BAR.*/
I also tried with analyzed field(just my_field), though I came to know that should not work. And of course, that didn't produce any results either.
Please help me with this regex search. Why am I not getting any matching result for that query?
I'm not sure offhand why that regex query wouldn't be working but I believe Kibana is using Elasticsearch's query string query documented here so for instance you could do a phrase query (documented in the link) by putting your search in double quotes and it would look for the word "foo" followed by "bar". This would perform better too since you would do this on your analyzed field (my_field) where it has tokenized each word to perform fast lookups. So you search in Kibana would be:
my_field: "FOO BAR"
Update:
Looks like this is an annoying quirk of Kibana (probably for backwards compatability reasons). Anyway, this isn't matching for you because you're searching against a non-analyzed field and apparently Kibana by default is lowercasing the search therefore it won't match the the non-analyzed uppercase "FOO". You can configure this in Kibana advanced settings mentioned here, specifically by setting the configuration option "lowercase_expanded_terms" to false.
Kibana’s standard query language is based on Lucene query syntax.
And the default analyzer will tokenize the text to different words: [MY, FOO, WORD, BAR, EXAMPLE]
Instead of using regex match, you can try the following search string in Kibana:
my_field: FOO AND my_field: BAR
And if your "my_field" data looks like "MYFOOWORDBAREXAMPLE",which can not be tokenized, you should use the query string:
my_field: *FOO*BAR*
GET /_search
{
"query": {
"regexp": {
"user": {
"value": "k.*y",
"flags" : "ALL",
"max_determinized_states": 10000,
"rewrite": "constant_score"
}
}
}
}
More details on here

Regex HTTP Response Body Message

I use a jmeter for REST testing.
I have made a HTTP Request, and this is the response data:
{"id":11,"name":"value","password":null,"status":"ACTIVE","lastIp":"0.0.0.0","lastLogin":null,"addedDate":1429090984000}
I need just the ID (which is 11) in
{"id":11,....
I use the REGEX below :
([0-9].+?)
It works perfectly but it will be a problem if my ID more than 2 digits. I need to change the REGEX to :
([0-9][0-9].+?)
Is there any dynamic REGEX for my problem. Thank you for your attention.
Regards,
Stefio
If you want any integer between {"id": and , use the following Regular Expression:
{"id":(\d+),
However the smarter way of dealing with JSON data could be JSON Path Extractor (available via JMeter Plugins), going forward this option can be much easier to use against complex JSON.
See Using the XPath Extractor in JMeter guide (scroll down to "Parsing JSON") to learn more on syntax and use cases.
I suggest using the following regular expression:
"id":([^,]*),
This will first find "id": and then look for anything that is not a comma until it finds a comma. Note the character grouping is only around the value of the ID.
This will work for ANY length ID.
Edit:
The same concept works for almost any JSON data, for example where the value is quoted:
"key":"([^"]*)"
That regular expression will extract the value from given key, as long as value is quoted and does not contain quotes. It first finds "key": and then matches anything that is not a quote until the next quote.
You can use the quantifier like this:
([0-9]{2,}.+?)
It will catch 2 or more digits, and then any symbol, 1 or more times. If you want to allow no other characters after the digits, use * instead of +:
([0-9]{2,}.*?)
Regex demo

Lucene match only exact query ignoring repeated terms

Given an index where the values of a property 'nodeName' reflect the list below, how can I use Lucene to return only nodes with an exactly matched name?
foo
bar
foobar
foo foo bar
If I search 'bar', I only want the second node returned.
I thought I could use regex in the search term (something like "+nodeName:\"/^{0}$\" where {0} is the query) to match on the start and end of the string, but that's not working - it returns all nodes that include the query.
Also tried an inclusive range ("+nodeName: [{0} TO {0}]") which returned nothing.
Regex query isn't really going to help you here. The regex in your query can not span multiple analyzed terms. The best way to ensure that a match spans the entire contents of a field is to index it in a way that facilitates that, that is, as a single token. I'm assuming this is a TextField using StandardAnalyzer, or something like it. In order to match against the whole input, a StringField would be a good choice, which would index the entire field as one token. Then a simple TermQuery could be used for this sort of search:
TermQuery("nodeName","bar") Would match only the document specified, rather than multiples
TermQuery("nodeName","foo foo bar") Would also match the last example, rather than none at all.
If you also need to be able to perform more standard (full-text) searches against analyzed text in this field, I would recommend indexing the same content in two separate fields, one StringField and one TextField.

REGEX - How to ignore some query strings in URLS, but not in others

I need to redirect an old blog URL to a new blog URL. The ID field is the key query string, and everything else in the query string should be ignored. The logic at a high level:
If old case insensitive URL matches: /Blog/Post.aspx? + ID=33 anywhere in the query string of the URL then I will redirect to: /newblog/newurl/
Current REGEX Code: (?i:/Blog/Post.aspx)|(\?)|(?i:id=33)
Success: /Blog/Post.aspx?id=33
Fails: /Blog/Post.aspx?ignore=me&id=33
Fails: /Blog/Post.aspx?ignore=me&id=33&ignoreme=too
How would I have it ignore the potential unknown query string ignore=me and ignoreme=too, but still come up with a REGEX match to redirect when the ID=33 is in the query string?
Thank you for the answer m.buettner!
Right now you would even redirect, if you have only ID=33 in your URL, or even if you have only a question mark in there. I suppose that is not what you want. You are probably looking for something like this:
(?i:/Blog/Post.aspx\?.*id=33(?!\w)).*
That will require /Blog/Post.aspx? and then allow arbitrary characters until the id=33 is encountered.
Depending on which language you are using this in, you could also use a lookahead, which makes it easier to check for different parameters, whose order you might not know:
(?i:/Blog/Post.aspx\?(?=.*id=33(?!\w))).*
This could then be easily extended to
(?i:/Blog/Post.aspx\?(?=.*id=33(?!\w))(?=.*another=requirement(?!\w))).*
With the first approach you would have to add two alternatives for both possible orders.
EDIT: A caveat for all three solutions: after the number they require a non-word character (that is anything but letters, digits or underscores). This means that they would give false positives in cases like ...id=33+34... and ...id=33%2F.... But these should not be generated by Wordpress in the first place.
Ops, I was going to bring a general answer to match general attributes in an url! Well I'm gonna leave it here in case that you need it
DEMO
(?:(id|noignoreme|dontignoreme)=([^&\n]+)(?:\n|&|$))
With this you can add the parameters you want to accept and it will return it as group1 (the option) and group2 (the text of that option).
After that you could see if ID = 33 then do that; else do thot;