Extract all subfolders of a path in Elasticsearch - regex

I want to extract all direct subfolders in a path field in elasticsearch.
For example I want all subfolders of this path: /path/to/file
These URLs should match
/path/to/file/subfolderA
/path/to/file/subfolder-b
/path/to/file/subfolder_c
These URLs should not match
/path/to/file/subfolderA/folderc
/path/to/file/subfolder-b/folderd/folderE
I tried with this regex query but it's not working. The part with the / is not working. But when I replace de / with a letter the query works. I tried to escape the / with a \ but it's not working either.
POST index_name/_search
{
"query": {
"regexp":{
"path_parent": "(/path/to/file/.*)&~(.*/.*)"
}
}

You may use negated character classes:
"path_parent": "/path/to/file/[^/]*"
^^^^^
Since ElasticSearch patterns are anchored by default this pattern will match all paths starting with /path/to/file/ and then having 0+ chars other than / followed with end of string.

Related

Regular expression to extract different parts of URL and path

Consider URLs like
https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
http://stackoverflow.com/v2/summary/saas?test=123
I need a regular expression to match these URLs and convert them into
stackoverflow.com:v1:summary:1243PQ:details:P1:9981
stackoverflow.com:v2:summary:saas
I need to build a single rule using regex where I can extract paths using $1, $2, etc. without using any javascript logic as I need to use it in a classification rule builder tool.
I tried this URL contains ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? and extracted $4:$5 which returns stackoverflow.com:v1/summary/1243PQ/details/P1/9981
But, this is incorrect. Can anyone help me with the correct regex for this?
You may try this:
Regex
/(?:https?:\/\/([^\/?\s#]+))?\/([^\/?\s#]*)(?:[\?#].*)?/g
Substitution
$1:$2
(?: non-capturing group
https?:\/\/ "http://" or "https://"
([^\/?\s#]+) capture the domain and put it in group 1
)? make this capture optional
\/ "/"
([^\/?\s#]*) one segment of the url path, capture it in group 2
(?:[\?#].*)? an optional non-capturing group for consuming query string or # anchor at the end
Check the test cases
Update
If you can't use g flag for substitution, there's no better way but bruteforce all the combinations:
You need to add a \/([^\/?#\s]+) and :$2 etc for each segment of the url path:
https://stackoverflow.com
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/?(?:[#?].*)?$
$1
https://stackoverflow.com/path1
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2
https://stackoverflow.com/path1/path2
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3
https://stackoverflow.com/path1/path2/path3
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4
https://stackoverflow.com/path1/path2/path3/path4
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5
https://stackoverflow.com/path1/path2/path3/path4/path5
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5:$6

Firebase redirect using Regex

My goal is to redirect any URL that does not start with a specific symbol ("#") to a different website.
I am using Firebase Hosting and already tried the Regex function in redirect to achieve this. I followed this firebase documentation on redirects but because I new to regular expressions I assume that my mistake might be my regex code.
My Goal:
mydomain.com/anyNotStartingWith# => otherdomain.com/anyNotStartingWith#
mydomain.com/#any => mydomain.com/#any
My Code:
{
"hosting": {
...
"redirects": [
{
"regex": "/^[^#]:params*",
"destination": "otherdomain.com/:params",
"type": 301
}
],
...
}
}
You can use
"regex": "/(?P<params>[^/#].*)"
The point is that you need a capturing group that will match and capture the part you want to use in the destination. So, in this case
/ - matches /
(?P<params>[^/#].*) - Named capturing group params (you can refer to the group from the destination using :params):
[^/#] - any char other than / and #
.* - any zero or more chars other than line break chars, as many as possible
To avoid matching files with .js, you can use
/(?P<params>[^/#].*(?:[^.].{2}$|.[^j].$|.{2}[^s]$))$
See this RE2 regex demo
See more about how to negate patterns at Regex: match everything but specific pattern.

Regex google analytics - Filter some directories and pages from url

I need to create a view in google analytics that can filter some pages.
The current regex I am using is:
^/(|es|pt|fr|de|blog|examples|about|)?(/(\w*)?)?$
Pages I want to show:
/
/es
/es/
/es/something-else
/blog/blog-post-123
/about
/examples
/examples/directory
/examples/directory/title-examples
Pages I don't want to show:
/any-other-url-not-mentioned
/es1231
/esad
/examplesee
/blogbla-bla-bala
/blog-bla-bla-ba
The current problem with my regex is if the page contains "-" it will NOT show up in the view:
/blog/10-tools
page/this-is-content-url
This one should work:
^\/(es|pt|fr|de|blog|examples|about)?($|\/).*
Regex101
You may use
^\/(es|pt|fr|de|blog|examples|about)?(\/[^\/]+)*\/?$
Or
^\/((es|pt|fr|de|blog|examples|about)(\/[^\/]+)*\/?)?$
See the regex demo and regex demo 2.
Details
^ - start of string
\/ - a /
(es|pt|fr|de|blog|examples|about)? - an optional group matching any of the alternative substrings in it
(\/[^\/]+)* - 0+ / followed with 1+ chars other than /
\/? - an optional /
$ - end of string
The second regex is a bit more precise since it won't match //, but it does not seem a valid scenario.

How can I handle virtual subdirectories with regex in app.yaml?

I'd like to point all of my visitors to "single subdirectories" to one page, and all visitors to "double subdirectories" to another. E.g:
/foo/
/new/
/north/
/1-j4/
Would all point to 1.app, whereas
/foo/bar/
/new/york/
/north/west/
/1-j4/a_990/
Would all point to 2.app.
I figured I could do this with non-greedy regex matching, like so:
- url: /(.*?)/$
script: 1.app
- url: /(.*?)/(.*?)/$
script: 2.app
To my confusion, both /foo/ and /foo/bar/ resolve to script 1.app. Does the "lazy" regex force itself up to include the middle /, since that's the only way to get a match? How else can I do this? I have tried using (\w*?) but get the same result.
The .*? will still match through any amount of / because . matches any character but a line break char (by default). You need to base your regexps on a negated character class, [^/]*, that matches 0 or more chars other than /.
To match directories with one part, use ^([^/]*)/?$ and to match those with 2, use ^([^/]*)/([^/]*)/?$.
Note that if you plan to use the patterns in online Web testers, you will have to escape / in most of them as by default they use / symbol as a regex delimiter.
Yes, the (.*?) includes slashes, so will resolve to 1.app. If you put the 2.app handler first, it should do what you want:
- url: /(.*?)/(.*?)/$
script: 2.app
- url: /(.*?)/$
script: 1.app

Exclude regular expression match if it contains a string

I'm still learning regular expressions and I seem to be stuck.
I wanted to write a reg exp that matches URL paths like these that contain "bulk":
/bulk-category_one/product
/another-category/bulk-product
to only get the product pages, but not the category pages like:
/bulk-category_one/
/another-category/
So I came up with:
[/].*(bulk).*[/].+|[/].*[/].*(bulk).*
But there's pagination, so when I put the reg exp in Google Analytics, I'm finding stuff like:
/bulk-category/_/showAll/1/
All of them have
/_/
and I don't want any URL paths that contain
/_/
and I can't figure out how to exclude them.
I would go about it this way:
/[^/\s]*bulk[^/]*/[^/\s]+(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*(?!/)
first part:
/ - match the slash
[^/\s]* - match everything that's not a slash and not a whitespace
bulk - match bulk literally
[^/]* - match everything that's not a slash
/ - match the slash
[^/\s]+ - match everything that's not a slash and not a whitespace
(?!/) - ensure there is not a slash afterwards (i.e. url has two parts)
The second part is more of the same, but this time 'bulk' is expected in the second part of the url not the first one.
If you need the word 'product' specifically in the second part of the url one more alternative would be required:
/[^/\s]*bulk[^/]*/[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*product[^/\s]*bulk[^/\s]*(?!/)
If I apply that simple regex to a file FILE
egrep ".*bulk.*product" FILE
which contains your examples above, it only matches the 2 lines with bulk and product. We can, additionally, exclude '/_/':
egrep ".*bulk.*product" FILE | egrep -v "/_/"
Two invocations are often much more easy to define and to understand, than a big one-fits-all.