Regex on picking S3 bucket name from S3 URI - regex

I have been trying to regex pattern to obtain S3 bucket name from S3 URI but have no luck.
example: s3://example-bucket/file-name.filetype
Closest I could get with this: \/\/([^\/].*[^\/])\/ but i'm not sure how to negate the slash from the result

The part ([^/]+) is looking for a sequence of of characters that are not slash.
keeping close with what you had you could write //([^/]+)/ but this is the same as
//([^/]+)
optional you could use lookbehind (?<=//) and/or lookahead (?=/)
(?<=//)([^/]+)(?=/)
(depending on your use cases a couple of different lookahead expressions are possible)
'lookaround' especially 'lookbehind' is not supported in all regexp dialects. e.g. not in JavaScript
Debuggex Demo

If you don't have the lookbehind option (javascript regex), and your URL structure is consistent enough, then perhaps this pattern will be useful:
[^\/]+(?=\/[^\/])
In English: "match one or more non-forward slashes, followed (but not actually matched, via the lookahead) by one forward slash and a non-forward slash".
(your other option would be to access your match groups in order to get rid of the slashes that you matched)
https://regex101.com/r/XM05Sw/2/

Related

Can I use negative lookahead and other conditions together in regex group?

I'm trying to match some URLs against another table using regex and - because the original source wasn't put together properly, I'm using a regex to clean them within the SQL.
As an example, the URLs might be /this-is-my-test-string/ or /this-is-my-test-string and the reference table is always of the form /this-is-my-test-string so using this regex works well to capture the matching part.
(\/[^\/)]*)\/?
However I've now come across some others with the form /this-is-my-test-string- and /this-is-my-test-string-/ which aren't as straightforward - I can't just add - to the exclusion as it's present in the rest of the string. From reading around - regex is not something I use regularly - a lookahead would seem to be the answer, but I can't work out how to include this in the expression.
Any help would be gratefully received.
You can use $ to anchor the end of the string, and use a non-greedy quantifier *? on the non-slash character set to allow -? to match a - from (or near) the end of the string:
(\/[^\/)]*?)-?\/?$

Confusion regarding regex pattern

I have tried to write a regex to catch certains words in a sentence but it is not working. The below regex is only working when I give a exact match.
[\s]*((delete)|(exec)|(drop\s*table)|(insert)|(shutdown)|(update)|(\bor\b))
Lets say I send a HTTP Header - headerName = insert it works,
but does not work when I give headerName = awesome insert number
--edit--
#user1180, Yes I can use prepared statements, but we are also looking into the regex part.
#Marcel and Wiktor, yes it is working in that website. I guess my tool is not recognizing the regex. I am using Mulesoft ESB, which uses Matches when the evaluated value fits a given regular expression (regex), specifically a regex "flavor" supported by Java.
It is using something like this,
matches /\+(\d+)\s\((\d+)\)\s(\d+\-\d+)/ and I am not aware of how to write my usecase in this regex format.
My usecase is too catch SQL injection pattern, which would check the request header/queryparam for delete (exec)(drop\s*table)(insert)(shutdown)(update)or parameters.
Since your regex must match the whole input you need to wrap the pattern with .*, something similar to (?s).*(<YOUR PATTERN>).*.
Use
(?s).*\b(delete|exec|drop\s+table|insert|shutdown|update|or)\b.*
Details
(?s) - turns on DOTALL mode where . matches any char
.* - any 0+ chars, as many as possible
\b(delete|exec|drop\s+table|insert|shutdown|update|or)\b - any one of the whole words (note \b is a word boundary construct) in the group
.* - any 0+ chars, as many as possible
I also replaced drop\s*table with drop\s+table since I guess droptable is not expected.

Match a url that does not contain certain word

I need some help for a regular expressions for not matching urls like these one:
/Common/Download.php?file=/path/to/file.pdf
and instead to matching these static urls:
/path/to/file.pdf
I have read many post (also in this site) but nothing seems to works as expected.
Thanks for your helps.
Lorenzo.
UPDATE
Sorry if this post is not so complete. I post more information to obtain a better help.
The regular expression that I need must work with Apache module mod_rewrite (and also with the module mod_rewrite of IIS (maybe this is not the right name) that is compatible with the module of Apache (as from my knowledge), if possible ) and must redirect the matching static urls (only of the second type, as from my post) to a specific page.
Thanks again.
Lorenzo.
Without knowing more about your programming language and regex parser, I'm keeping my regex really generic, but something like this should get you close:
^/([A-Za-z0-9]+/)+[A-Za-z0-9]+\.[A-Za-z0-9]{3,4}$
This matches a string starting with a slash, one or more directories separated by slashes, and ending with a filename with a three or four character file extension.
This means /path/to/some/really/buried/file.html would match too.
Using an interactive regular expression evaluator is a great way to rapidly write and debug regular expressions, especially if you are new to them. I really like The Regex Coach for this.
Another option could be to repeat the forward slash lowercase characters pattern in a non capturing group and repeat that. Then match the file extension .pdf
^(?:/[a-z]+){3}\.pdf$
Explanation
From the beginning of the string ^
Non capturing group (?:
Match one or more lowercase characters [a-z]+
Close the non capturing group and match 3 times ){3}
Match a dot \. and pdf
The end of the string $
Or repeat the group 2 times and for the filename use \w+
^(?:/[a-z]+){2}/\w+\.pdf$
If you want to match your example static url and maybe longer or shorter paths like /path/file.pdf or /dir/path/to/file.pdf you could for example use:
^(?:/\w+)+\.\w+$

Clean and extract Subdomains & Domains from URLs using Regex Notepad++

This is simple text file.
The URL:
Can have https:// or http://
Eliminate both as well as trailing url/ file paths
Extract only domains and/or subdomains
I have Notepad++ and EditPlus
open to other Suggestions?
Examples:
https://appspace.com
http://appspace.com/
http://ayurfit.ning.com/main/authorization/signIn
http://bangalore.olx.in/login.php
http://birthdayshoes.com/forum/index.php
http://birthdayshoes.com/forum/register/
http://forums.virtualbox.org/ucp.php
Tries:
/(?!.{253})((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.){1,126}+[A-Za-z]{2,6}/
^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$
https://regex101.com/r/hZ4cL4/4
Tried many on other machine as examples from Regex101
Found this little nugget as well. I'll post how its different once I understand it.
Regular Expression - Extract subdomain & domain
For the links that start with protocol, you can use the following regex:
(?<=://)[\w-]+(?:\.[\w-]+)+\b
See demo
The (?<=://) look-behind makes sure there is :// before the value we want to match, and the whole matched text consists of sequences of 1 or more word characters or hyphens ([\w-]+) that are eventually separated with periods.
You could simply extract anything that is between two . Additionally
you could use lookbehinds for http(s) and lookahead for the filepath
to fine tune your results.

Google Analytics Regex - Alternative to no negative lookahead

Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include.
The regex that includes negative lookahead that would work if it was enabled is:
test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
This matches:
test.com
test.com/
test.com/index_fb2.php
test.com/index_fb2.php?ref=23
test.com/index_fb2.php?ref=23&e=35
test.com/?ref=23
test.com/?ref=23&e=35
and does not match (as it should):
test.com/ambassadors
test.com/admin/?signup=true
test.com/randomtext/
I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead.
Thank you!
Google Analytics doesn't seem to support single-line and multiline modes, which makes sense to me. URLs can't contain newlines, so it doesn't matter if the dot doesn't match them and there's never any need for ^ and $ to match anywhere but the beginning and end of the whole string.
That means the (?!.) in your regex is exactly equivalent to $, which matches only at the very end of the string (like \z, in flavors that support it). Since that's the only lookahead in your regex, you should never have have had this problem; you should have been using $ all along.
However, your regex has other problems, mostly owing to over-reliance on (.*). For example, it matches these strings:
test.com/?^#(%)!*%supercalifragilisticexpialidocious
test.com/index_ecky-ecky-ecky-ecky-PTANG!-vroop-boing_rowr.php (ni! shh!)
...which I'm pretty sure you don't want. :P
Try this regex:
test\.com(?:/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?)?\s*$
or more readably:
test\.com
(?:
/
(?:index_\w+\.php)?
(?:
\?ref=\d+
(?:
&e=\d+
)?
)?
)?
\s*$
For illustration purposes I'm making a lot of simplifying assumptions about (e.g.) what parameters can be present, what order they'll appear in, and what their values can be. I'm also wondering if it's really necessary to match the domain (test.com). I have no experience with Google Analytics, but shouldn't the match start (and be anchored) right after domain? And do you really have to allow for whitespace at the end? It seems to me the regex should be more like this:
^/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?$
Firstly I think your regex needs some fixing. Let's look at what you have:
test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
The case where you use the optional ? at the start of index... is already taken care of by the second alternative:
test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
Now you probably only want the first (.*) to be allowed, if there actually was a literal ? before. Otherwise you will match test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat. So move the corresponding optional marker:
test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now .* consumes any character and as much as possible. Also, the . in front of php consumes any character. This means you would be allowing both test.com/index_fb2php and test.com/index_fb2.html?someparam=php. Let's make that a literal . and only allow non-question-mark characters:
test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now the first and second and third option can be collapsed into one, if we make the file name optional, too:
test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.)
Finally, the + can be removed, because the (.*) inside can already take care of all possible repetitions. Also (something|) is the same as (something)?:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.)
Seeing your input examples, this seems to be closer to what you actually want to match.
Then to answer your question. What (?!.) does depends on whether you use singleline mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by \Z, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use $ but you need to also use multi-line mode, so that $ matches line-endings, too.
So, if you use singleline mode (which probably means you have only one URL per string), use this:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z
If you do not use singleline mode (which probably means you can have multiple URLs on their own lines), you should also use multiline mode and this kind of anchor instead:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$