I need to extract ONLY the file names from any URL. I looked at all previous answers on stackoverflow regarding URLs and filenames, but no one considered the case of a file name with escaped characters.
I have for example an URL like this:
https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz
I tried many RegEx, and finally I found one that did not split the file names when it encounter the escaped character:
"(?:\w*:\/\/)?((?:[\w-_]*\.?)+:?\d*(?:\/?[\w-_.]+\/?)*)[\?]?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?"g
You can test it here: https://regex101.com/r/LRWlif/7
The results are a mess:
match,group,is_participating,start,end,content
1,0,yes,0,148,https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz
1,1,yes,8,60,content.com/pbpython.py/notebooks/thirsty-allies.mov
1,2,yes,61,65,file
1,3,yes,66,94,The%20Big%20Kahuna.webm.tar.gz
1,4,yes,95,96,f
1,5,yes,97,123,Crosstab%20Explained.ipynb
1,6,yes,124,125,a
1,7,yes,126,127,b
1,8,yes,128,129,m
1,9,yes,130,148,plok%202001.tar.gz
2,0,yes,148,148,
2,1,yes,148,148,
2,2,yes,148,148,
2,3,yes,148,148,
2,4,yes,148,148,
2,5,yes,148,148,
2,6,yes,148,148,
2,7,yes,148,148,
2,8,yes,148,148,
2,9,yes,148,148,
The only good thing is that the filenames are all matched somehow, with no split parts, with the exception of "thirsty-allies.mov" that is matched along some url parts.
Also there is the issue that not all escape characters can be part of a filename. %2F for example is the "/" that separate folders in paths, and should not considered part of the match.
For example:
https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
With the same RegEx we get this result:
match,group,is_participating,start,end,content
1,0,yes,0,288,https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
1,1,yes,8,56,www.contoso.com/sites/marketing/documents/Shared
1,2,yes,56,56,
1,3,yes,56,99,%20Documents/Forms/AllItemA.aspx?RootFolder
1,4,yes,99,99,
1,5,yes,100,188,%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx
1,6,yes,189,199,FolderCTID
1,7,yes,200,240,0x012000F2A09653197F4F4F919923797C42ADEC
1,8,yes,241,245,View
1,9,yes,246,288,%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
2,0,yes,288,288,
2,1,yes,288,288,
2,2,yes,288,288,
2,3,yes,288,288,
2,4,yes,288,288,
2,5,yes,288,288,
2,6,yes,288,288,
2,7,yes,288,288,
2,8,yes,288,288,
2,9,yes,288,288,
As you can see, the filename to match is:
PFProduct%20Promotion%202001.docx
but the RegEx matched:
%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx
How can I get just the filenames and nothing else?
There is no language tagged, but if you know that you always have urls you might use
(?<=[=\/]|%2F)(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)
Explanation
(?<= Positive lookbehind, assert what is to the left of the current position is
[=\/] Match either = or /
| Or
%2F Match literally
) Close the lookbehind
(?: Non capture group
(?!%2F)[^?&\s\/] Match 1 char other than what is listed in the character class if %2F is not directly to the right of the current position
)+ Close the non capture group and repeat 1+ times
\.\w+ Match a dot and 1 or more word characters
(?=[?&]|$) Positive lookahead, assert either ? or & or the end of the string directly to the right of the current position
Regex demo
Other variations
Or with a capture group if the lookbehind does not work with not fixed width:
(?:[=\/]|%2F)((?:(?!%2F)[^?&\s\/])+\.\w+)(?=[?&]|$)
Regex demo
In languages where an infinite quantifier in the lookbehind is supported:
(?<=https?:\/\/\S*(?:[=\/]|%2F))(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)
Regex demo
I am looking to get all non dot-files in a folder with a particular extension. So far my regex is:
(?<=\/|^)(?<!\.)(\w+(?:\.mov|\.py|))$
Is there a way to improve the above regex? What might be some examples where this regex might not work?
The \w+ will only match one or more letters, digits or _. It will not match the rest of the chars that may constitute a valid file name. Also, your (?<!\.) lookbehind is redundant because the previous lookbehind already excludes a dot at that position.
Besides, you do not have to repeat the comma pattern, you may use grouping for extensions only.
You may use
(?<=\/|^)([^\/]+)(\.(?:mov|py))$
See this regex demo
(?<=\/|^) - / or start of string allowed immediately on the left
([^\/]+) - Group 1: any one or more chars other than /
(\.(?:mov|py)) - Group 2: a . char and then either mov or py
$ - end of string/
Note you may also replace (?<=\/|^) with (?<![^\/]) in real code since it will work the same with standalone strings. It will mess the demo results at regex101.com because there, you test against a single multiline string (that is why I added \n to the negated character class there, too).
Here's how I would do it:
(?<=\/|^)[^\/\\:*?"<>|\n]+\.(?:mov|py)$
(?<=\/|^) Lookbehind just like you had it
[^\/\\:*?"<>|\n]+ One or more of any character that is not disallowed in filenames
\. A literal dot
(?:mov|py) Either "mov" or "py" in a non-capturing group (similar to yours, but I moved the dot out and excluded the redundant "|")
$ Anchors the search to the end of the line, so only files will match, no folders
I have the following languages or language locale codes in a URL and i am trying to identify through REGEX. I was partially successful in identifying them but it is failing for some scenarios
Languages that i am testing with
en-us -- Passes
us -- Fails
Here is the REGEX that i have
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/)c\/(deals-and-tips\/)?
For instance:
https://forum.leasehackr.com/en-us/c/deals-and-tips (passes)
https://forum.leasehackr.com/us/c/deals-and-tips (fails)
What am I missing in the above REGEX?
The regex you wanted is:
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips\/)?
The difference from your regex is that I moved the first \/ from inside the parenthesis to outside (to sit with c\/).
Test here.
The last / fails the match in any case since your urls doesn't have it, in any way I would rewrite your regex as this: ([a-zA-Z]{2})(-[a-zA-Z]{2})?\/c\/(deals-and-tips)?.
This way it always looks for the first part (en) and consider the second (-us) as optional.
Alternatively use (\w{2})(-\w{2})?\/c\/(deals-and-tips)?, if you don't mind risking to match underscores and similar simbols
The reason your pattern does not match us is because the alternation ([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/) only matches the \/ in the second part of the alternation.
Also it does not match the last group with deals-and-tips because there is no trailing \/ in the example data.
Your updated pattern might look like
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips)?
Regex demo
You could shorten the pattern a bit by using an optional non capturing group (?:-[a-zA-Z]{2})? inside the first capturing group to optionally match the part starting with a hyphen.
As in the example data you could match the leading \/ in front of the capturing group to get a more efficient match.
\/([a-zA-Z]{2}(?:-[a-zA-Z]{2})?)\/c\/(deals-and-tips)?
In parts
\/ To be a bit more precise, match the leading /
( Capture group 1
[a-zA-Z]{2} Match 2 chars a-z
(?:-[a-zA-Z]{2})? Optionally match - and 2 chars a-z
) Close group
\/c\/ Match /c/deals-and-tips`
(deals-and-tips)? Optional capture group 2 match deals-and-tips
Regex demo
Note that if you use another delimiter than / you don't have to escape the forward slash.
I'm trying to replace a link in a html file with regex and nodejs. I want to replace links without a .min.js extension.
For example, it should match "common.js" but not "common.min.js"
Here's what I've tried:
let htmlOutput = html.replace(/common\.(?!min)*js/g, common.name);
I think this negative lookahead should work but it doesn't match anything. Any help would be appreciated.
The (?!min)*js part is corrupt: you should not quantify zero-width assertions like lookaheads (they do not consume text so quantifiers after them are treated either as user errors or are ignored). Since js does not start with min this lookahead even without a quantifier is redundant.
If you want to match a string with a whole word common, then having any chars and ending with .js but not .min.js you need
/\bcommon\b(?!.*\.min\.js$).*\.js$/
See the regex demo.
Details:
\b - word boundary
common - a substring
\b - word boundary
(?!.*\.min\.js$) - immediately to the right, there should not be any 0 or more chars followed with .min.js at the end of the string
.* - any 0 or more chars
\.js - a .js substring
$ - end of string.
Here, we likely can find a simple expression to pass any char except new lines and ., after the word common, followed by .js:
common([^\.]+)?\.js
Demo
RegEx Circuit
jex.im visualizes regular expressions:
The end regex I'm using is /\bcommon[^min]+js\b/g
This will find the word common with any amount of chracters afterword except if those characters contain the word minand ending in js allowing me to replace scripts on my html page like:
script src="~/dist/common.js"
OR
script src="~/dist/common.9cf5748e0e7fc2928a07.js"
Thanks to Wiktor Stribiżew for helping me.