I need to extract ONLY the file names from any URL. I looked at all previous answers on stackoverflow regarding URLs and filenames, but no one considered the case of a file name with escaped characters.
I have for example an URL like this:
https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz
I tried many RegEx, and finally I found one that did not split the file names when it encounter the escaped character:
"(?:\w*:\/\/)?((?:[\w-_]*\.?)+:?\d*(?:\/?[\w-_.]+\/?)*)[\?]?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?"g
You can test it here: https://regex101.com/r/LRWlif/7
The results are a mess:
match,group,is_participating,start,end,content
1,0,yes,0,148,https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz
1,1,yes,8,60,content.com/pbpython.py/notebooks/thirsty-allies.mov
1,2,yes,61,65,file
1,3,yes,66,94,The%20Big%20Kahuna.webm.tar.gz
1,4,yes,95,96,f
1,5,yes,97,123,Crosstab%20Explained.ipynb
1,6,yes,124,125,a
1,7,yes,126,127,b
1,8,yes,128,129,m
1,9,yes,130,148,plok%202001.tar.gz
2,0,yes,148,148,
2,1,yes,148,148,
2,2,yes,148,148,
2,3,yes,148,148,
2,4,yes,148,148,
2,5,yes,148,148,
2,6,yes,148,148,
2,7,yes,148,148,
2,8,yes,148,148,
2,9,yes,148,148,
The only good thing is that the filenames are all matched somehow, with no split parts, with the exception of "thirsty-allies.mov" that is matched along some url parts.
Also there is the issue that not all escape characters can be part of a filename. %2F for example is the "/" that separate folders in paths, and should not considered part of the match.
For example:
https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
With the same RegEx we get this result:
match,group,is_participating,start,end,content
1,0,yes,0,288,https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
1,1,yes,8,56,www.contoso.com/sites/marketing/documents/Shared
1,2,yes,56,56,
1,3,yes,56,99,%20Documents/Forms/AllItemA.aspx?RootFolder
1,4,yes,99,99,
1,5,yes,100,188,%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx
1,6,yes,189,199,FolderCTID
1,7,yes,200,240,0x012000F2A09653197F4F4F919923797C42ADEC
1,8,yes,241,245,View
1,9,yes,246,288,%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
2,0,yes,288,288,
2,1,yes,288,288,
2,2,yes,288,288,
2,3,yes,288,288,
2,4,yes,288,288,
2,5,yes,288,288,
2,6,yes,288,288,
2,7,yes,288,288,
2,8,yes,288,288,
2,9,yes,288,288,
As you can see, the filename to match is:
PFProduct%20Promotion%202001.docx
but the RegEx matched:
%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx
How can I get just the filenames and nothing else?
There is no language tagged, but if you know that you always have urls you might use
(?<=[=\/]|%2F)(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)
Explanation
(?<= Positive lookbehind, assert what is to the left of the current position is
[=\/] Match either = or /
| Or
%2F Match literally
) Close the lookbehind
(?: Non capture group
(?!%2F)[^?&\s\/] Match 1 char other than what is listed in the character class if %2F is not directly to the right of the current position
)+ Close the non capture group and repeat 1+ times
\.\w+ Match a dot and 1 or more word characters
(?=[?&]|$) Positive lookahead, assert either ? or & or the end of the string directly to the right of the current position
Regex demo
Other variations
Or with a capture group if the lookbehind does not work with not fixed width:
(?:[=\/]|%2F)((?:(?!%2F)[^?&\s\/])+\.\w+)(?=[?&]|$)
Regex demo
In languages where an infinite quantifier in the lookbehind is supported:
(?<=https?:\/\/\S*(?:[=\/]|%2F))(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)
Regex demo
I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.
I have <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] to catch everything inside
<autorpodpis>_this_is_an_example_of_what_I'd_like_to_match< If there is a space, a colon (;) or a semicolon (;) or a space before a colon or a semicolon, my RegEx catches everything but including these characters – see my link. It works as it is expected to.
Overall, the RegEx works fine with substitution \1 (or in AutoHotKey I use – $1). But I'd like match without using substitution.
You seem to mix the terms substitution (regex based replacement operation) and capturing (storing a part of the matched value captured with a part of a pattern enclosed with a pair of unescaped parentheses inside a numbered or named stack).
If you want to just match a substring in specific context without capturing any subvalues, you might consider using lookarounds (lookbehind or lookahead).
In your case, since you need to match a string after some known string, you need a lookbehind. A lookbehind tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.
So, you could use
pos := RegExMatch(input, "(?<=<autorpodpis>)\p{L}+(?:\s+\p{L}+)*", Res)
So, the Res should have WOJCIECH ZAŁUSKA if you supply <autorpodpis>WOJCIECH ZAŁUSKA</autorpodpis> as input.
Explanation:
(?<=<autorpodpis>) - check if there is <autorpodpis> right before the currently tested location. If there is none, fail this match, go on to the next location in string
\p{L}+ - 1+ Unicode letters
(?:\s+\p{L}+)* - 0+ sequences of 1+ whitespaces followed with 1+ Unicode letters.
However, in most cases, and always in cases like this when the pattern in the lookbehind is known, the lookbehind is unanchored (say, when it is the first subpattern in the pattern) and you do not need overlapping matches, use capturing.
The version with capturing in place:
pos := RegExMatch(input, "<autorpodpis>(\p{L}+(?:\s+\p{L}+)*)", Res)
And then Res[1] will hold the WOJCIECH ZAŁUSKA value. Capturing is in most cases (96%) faster.
Now, your regex - <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] - is not efficient as the [^;,<\n\r] also matches \s and \s matches [;,<\n\r]. My regex is linear, each subsequent subpattern does not match the previous one.