Chrome Extension Multi URL filtering - regex

I am having trouble making my chrome extension work with multiple URL's What should be the format for listing URL's to match?
chrome.webNavigation.onDOMContentLoaded.addListener(function (o) {
chrome.tabs.executeScript(o.id, {
code: "////////"
})}, {
url: [
{ urlContains: ['/shop/jacket', 'shop/t-shirt']}
]
});
I'm assuming a regex would work, but how would I write that?

Your code doesn't work because urlContains expects a single string only.
The simplest regex operator that matches "A or B" is A|B.
So, in your case { urlMatches : "/shop/jacket|shop/t-shirt" }. It's simple in your case since your URL substrings do not contain special characters; in the general case you may need to \-escape some characters.

Related

Firebase redirect using Regex

My goal is to redirect any URL that does not start with a specific symbol ("#") to a different website.
I am using Firebase Hosting and already tried the Regex function in redirect to achieve this. I followed this firebase documentation on redirects but because I new to regular expressions I assume that my mistake might be my regex code.
My Goal:
mydomain.com/anyNotStartingWith# => otherdomain.com/anyNotStartingWith#
mydomain.com/#any => mydomain.com/#any
My Code:
{
"hosting": {
...
"redirects": [
{
"regex": "/^[^#]:params*",
"destination": "otherdomain.com/:params",
"type": 301
}
],
...
}
}
You can use
"regex": "/(?P<params>[^/#].*)"
The point is that you need a capturing group that will match and capture the part you want to use in the destination. So, in this case
/ - matches /
(?P<params>[^/#].*) - Named capturing group params (you can refer to the group from the destination using :params):
[^/#] - any char other than / and #
.* - any zero or more chars other than line break chars, as many as possible
To avoid matching files with .js, you can use
/(?P<params>[^/#].*(?:[^.].{2}$|.[^j].$|.{2}[^s]$))$
See this RE2 regex demo
See more about how to negate patterns at Regex: match everything but specific pattern.

Sonatype NXRM - Asset Name Matcher [duplicate]

I have a link like http://drive.google.com and I want to match "google" out of the link.
I have:
query: {
bool : {
must: {
match: { text: 'google'}
}
}
}
But this only matches if the whole text is 'google' (case insensitive, so it also matches Google or GooGlE etc). How do I match for the 'google' inside of another string?
The point is that the ElasticSearch regex you are using requires a full string match:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
Thus, to match any character (but a newline), you can use .* pattern:
match: { text: '.*google.*'}
^^ ^^
In ES6+, use regexp insted of match:
"query": {
"regexp": { "text": ".*google.*"}
}
One more variation is for cases when your string can have newlines: match: { text: '(.|\n)*google(.|\n)*'}. This awful (.|\n)* is a must in ElasticSearch because this regex flavor does not allow any [\s\S] workarounds, nor any DOTALL/Singleline flags. "The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators."
However, if you do not plan to match any complicated patterns and need no word boundary checking, regex search for a mere substring is better performed with a mere wildcard search:
{
"query": {
"wildcard": {
"text": {
"value": "*google*",
"boost": 1.0,
"rewrite": "constant_score"
}
}
}
}
See Wildcard search for more details.
NOTE: The wildcard pattern also needs to match the whole input string, thus
google* finds all strings starting with google
*google* finds all strings containing google
*google finds all strings ending with google
Also, bear in mind the only pair of special characters in wildcard patterns:
?, which matches any single character
*, which can match zero or more characters, including an empty one
use wildcard query:
'{"query":{ "wildcard": { "text.keyword" : "*google*" }}}'
For both partial and full text matching ,the following worked
"query" : {
"query_string" : {
"query" : "*searchText*",
"fields" : [
"fieldName"
]
}
I can't find a breaking change disabling regular expressions in match, but match: { text: '.*google.*'} does not work on any of my Elasticsearch 6.2 clusters. Perhaps it is configurable?
Regexp works:
"query": {
"regexp": { "text": ".*google.*"}
}
For partial matching you can either use prefix or match_phrase_prefix.
For a more generic solution you can look into using a different analyzer or defining your own. I am assuming you are using the standard analyzer which would split http://drive.google.com into the tokens "http" and "drive.google.com". This is why the search for just google isn't working because it is trying to compare it to the full "drive.google.com".
If instead you indexed your documents using the simple analyzer it would split it up into "http", "drive", "google", and "com". This will allow you to match anyone of those terms on their own.
using node.js client
tag_name is the field name, value is the incoming search value.
const { body } = await elasticWrapper.client.search({
index: ElasticIndexs.Tags,
body: {
query: {
wildcard: {
tag_name: {
value: `*${value}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
},
});
You're looking for a wildcard search. According to the official documentation, it can be done as follows:
query_string: {
query: `*${keyword}*`,
fields: ["fieldOne", "fieldTwo"],
},
Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters: qu?ck bro*
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-wildcard
Be careful, though:
Be aware that wildcard queries can use an enormous amount of memory and perform very badly — just think how many terms need to be queried to match the query string "a* b* c*".
Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false.

Query document based on field's value containing backslash using regex

I'm trying to query DB with documments similar to one presented below.
{
"_id":"5b9bd1b947c7471038399a39",
"subdir":"ge\\pt02\\kr02_20180824\\kr02_2018091log\\0010796ab5",
}
How to filter all documments starting with: ge\\pt02\\kr02
I tried many different approaches,
for example:
{"subdir": {"$regex": "pt02\\kr02*"}}
but I cannot figure out how to prepare a correct filter:
The problem is that you need to escape the slashes.
Here is a working example:
db.test1.insert({"subdir":"ge\\pt02\\kr02_20180824\\k2_2018091log\\0010796ab5"})
db.test1.find({"subdir": { $regex: "^ge\\\\pt02\\\\kr02"}})
This prints out:
{ "_id" : ObjectId("5ba28194fbb45cb9f7c58b18"), "subdir" : "ge\\pt02\\kr02_20180824\\kr02_2018091log\\0010796ab5" }
We need to escape the backslash there. Also since you want to select only the documents starting with this pattern, you need to group the regex into a parenthesis and prefix the group with caret. This gives us the following regex:
let pattern = "^(ge\\\\pt02\\\\kr02)";
{"subdir": {"$regex": pattern}}
Demo:

logstash grok filter regular expression works in debug tool but failed in actual execution

I'm trying to extract a filed out of log line, i use http://grokdebug.herokuapp.com/ to debug my regular expression with:
(?<action>(?<=action=).*(?=\&))
with input text like this:
/event?id=123&action={"power":"on"}&package=1
i was able to get result like this:
{
"action": [
"{"power":"on"}"
]
}
but when i copy this config to my logstash config file:
input { stdin{} }
filter {
grok {
match => { "message" => "(?<action>(?<=action=).*(?=\&))"}
}
}
output { stdout {
codec => 'json'
}}
the output says matching failed:
{"message":" /event?id=123&action={\"power\":\"on\"}&package=1","#version":"1","#timestamp":"2016-01-05T10:30:04.714Z","host":"xxx","tags":["_grokparsefailure"]}
i'm using logstash-2.1.1 in cygwin.
any idea why this happen?
You might experience an issue caused by a greedy dot matching subpattern .*. Since you are only interested in a string of text after action= till next & or end of string you'd better use a negated character class [^&].
So, use
[?&]action=(?<action>[^&]*)
The [?&] matches either a ? or & and works as a boundary here.
It doesn't answer your regexp question, but...
Parse the query string to a separate field and use the kv{} filter on it.

ElasticSearch Regexp Filter

I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all for now, but this will eventually be match_phrase with the user's query string.
POST to localhost:9200/_search
{
"query" : {
"match_all" : { },
"filtered" : {
"filter" : {
"regexp": {
"url":".*info-for/media.*"
}
}
}
},
}
This returns 0 hits, but does parse correctly. .*info.* does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?
First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.* is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.
This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/
Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media-term in the dictionary for the regexp to match.
What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.
Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis
I.e. /foo/bar/baz generates the tokens /foo/bar/baz, /foo/bar, /foo and the domain foo.example.com is tokenized to foo.example.com, example.com, com
A search for anything in below /foo/bar could then be a simple term filter matching path:/foo/bar. That's a massively more performant filter, which can also be cached.