Regular expression for variable routes - regex

I have the following directory:
Videos/common/Project/Project01/video.project_01.StatusOK/video.project_01.StatusOK.csproj
And the regular expression that I use to extract only with the last part of the route (video.project_01.StatusOK.csproj) is the following:
([\w|.])/Project/([\w|.|\s])/([\w|.|\s])/([\w|.|\s])([.]*)
The problem is that if the route varies, that is if there is a directory before: video.project_01.StatusOK.csproj, for example like this: Videos/common/Project/Project01/video.project_01.StatusOK/test/video.project_01. StatusOK.csproj, I would extract 'test'.
Let's see if someone can help me with a regular expression for java, always extract the last part which contains the '.csproj', whatever the route.
Regards, and thank you very much

Try this Regex:
(?<=\/)[^\/]+csproj
Click for Demo
See JAVA code HERE
Explanation:
(?<=\/) - positive lookbehind to find the position immediately preceded by a /
[^\/]+ - matches 1+ occurrences of any character that is not a /
csproj - matches csproj literally

In case you are unaware, Java 7 introduced NIO2 which brought a new interface java.nio.file.Path. You can break up the path to your directory and then use a regular expression on each part of the path.
Oracle's Java Tutorial has a section on Path Operations
(There is also a section on Regular Expressions)

If you want to keep to the /Project/ in your path, you could try this:
.*?/Project/.*?(?<=\/)([\w+. ]+\.csproj)$
That would match
match any character zero or more times non greedy (.*?)
match /Project/
match any character zero or more times non greedy (.*?)
positive lookbehind that asserts that what is before is a forward slash (?<=\/)
A capturing group ( this will contain your match
A character class that will match one or more word characters, dot or whitespace [\w. ]+ one or more times
Match .csproj \.csproj
Close the capturing group )
The end of the string $

Related

How to extract only filenames from an URL with RegEx even with ANSI escaped characters?

I need to extract ONLY the file names from any URL. I looked at all previous answers on stackoverflow regarding URLs and filenames, but no one considered the case of a file name with escaped characters.
I have for example an URL like this:
https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz
I tried many RegEx, and finally I found one that did not split the file names when it encounter the escaped character:
"(?:\w*:\/\/)?((?:[\w-_]*\.?)+:?\d*(?:\/?[\w-_.]+\/?)*)[\?]?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?"g
You can test it here: https://regex101.com/r/LRWlif/7
The results are a mess:
match,group,is_participating,start,end,content
1,0,yes,0,148,https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz
1,1,yes,8,60,content.com/pbpython.py/notebooks/thirsty-allies.mov
1,2,yes,61,65,file
1,3,yes,66,94,The%20Big%20Kahuna.webm.tar.gz
1,4,yes,95,96,f
1,5,yes,97,123,Crosstab%20Explained.ipynb
1,6,yes,124,125,a
1,7,yes,126,127,b
1,8,yes,128,129,m
1,9,yes,130,148,plok%202001.tar.gz
2,0,yes,148,148,
2,1,yes,148,148,
2,2,yes,148,148,
2,3,yes,148,148,
2,4,yes,148,148,
2,5,yes,148,148,
2,6,yes,148,148,
2,7,yes,148,148,
2,8,yes,148,148,
2,9,yes,148,148,
The only good thing is that the filenames are all matched somehow, with no split parts, with the exception of "thirsty-allies.mov" that is matched along some url parts.
Also there is the issue that not all escape characters can be part of a filename. %2F for example is the "/" that separate folders in paths, and should not considered part of the match.
For example:
https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
With the same RegEx we get this result:
match,group,is_participating,start,end,content
1,0,yes,0,288,https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
1,1,yes,8,56,www.contoso.com/sites/marketing/documents/Shared
1,2,yes,56,56,
1,3,yes,56,99,%20Documents/Forms/AllItemA.aspx?RootFolder
1,4,yes,99,99,
1,5,yes,100,188,%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx
1,6,yes,189,199,FolderCTID
1,7,yes,200,240,0x012000F2A09653197F4F4F919923797C42ADEC
1,8,yes,241,245,View
1,9,yes,246,288,%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
2,0,yes,288,288,
2,1,yes,288,288,
2,2,yes,288,288,
2,3,yes,288,288,
2,4,yes,288,288,
2,5,yes,288,288,
2,6,yes,288,288,
2,7,yes,288,288,
2,8,yes,288,288,
2,9,yes,288,288,
As you can see, the filename to match is:
PFProduct%20Promotion%202001.docx
but the RegEx matched:
%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx
How can I get just the filenames and nothing else?
There is no language tagged, but if you know that you always have urls you might use
(?<=[=\/]|%2F)(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)
Explanation
(?<= Positive lookbehind, assert what is to the left of the current position is
[=\/] Match either = or /
| Or
%2F Match literally
) Close the lookbehind
(?: Non capture group
(?!%2F)[^?&\s\/] Match 1 char other than what is listed in the character class if %2F is not directly to the right of the current position
)+ Close the non capture group and repeat 1+ times
\.\w+ Match a dot and 1 or more word characters
(?=[?&]|$) Positive lookahead, assert either ? or & or the end of the string directly to the right of the current position
Regex demo
Other variations
Or with a capture group if the lookbehind does not work with not fixed width:
(?:[=\/]|%2F)((?:(?!%2F)[^?&\s\/])+\.\w+)(?=[?&]|$)
Regex demo
In languages where an infinite quantifier in the lookbehind is supported:
(?<=https?:\/\/\S*(?:[=\/]|%2F))(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)
Regex demo

Improve regex for capturing files in a directory, excluding dotfiles

I am looking to get all non dot-files in a folder with a particular extension. So far my regex is:
(?<=\/|^)(?<!\.)(\w+(?:\.mov|\.py|))$
Is there a way to improve the above regex? What might be some examples where this regex might not work?
The \w+ will only match one or more letters, digits or _. It will not match the rest of the chars that may constitute a valid file name. Also, your (?<!\.) lookbehind is redundant because the previous lookbehind already excludes a dot at that position.
Besides, you do not have to repeat the comma pattern, you may use grouping for extensions only.
You may use
(?<=\/|^)([^\/]+)(\.(?:mov|py))$
See this regex demo
(?<=\/|^) - / or start of string allowed immediately on the left
([^\/]+) - Group 1: any one or more chars other than /
(\.(?:mov|py)) - Group 2: a . char and then either mov or py
$ - end of string/
Note you may also replace (?<=\/|^) with (?<![^\/]) in real code since it will work the same with standalone strings. It will mess the demo results at regex101.com because there, you test against a single multiline string (that is why I added \n to the negated character class there, too).
Here's how I would do it:
(?<=\/|^)[^\/\\:*?"<>|\n]+\.(?:mov|py)$
(?<=\/|^) Lookbehind just like you had it
[^\/\\:*?"<>|\n]+ One or more of any character that is not disallowed in filenames
\. A literal dot
(?:mov|py) Either "mov" or "py" in a non-capturing group (similar to yours, but I moved the dot out and excluded the redundant "|")
$ Anchors the search to the end of the line, so only files will match, no folders

Regex to match ISO languages ISO

I have the following languages or language locale codes in a URL and i am trying to identify through REGEX. I was partially successful in identifying them but it is failing for some scenarios
Languages that i am testing with
en-us -- Passes
us -- Fails
Here is the REGEX that i have
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/)c\/(deals-and-tips\/)?
For instance:
https://forum.leasehackr.com/en-us/c/deals-and-tips (passes)
https://forum.leasehackr.com/us/c/deals-and-tips (fails)
What am I missing in the above REGEX?
The regex you wanted is:
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips\/)?
The difference from your regex is that I moved the first \/ from inside the parenthesis to outside (to sit with c\/).
Test here.
The last / fails the match in any case since your urls doesn't have it, in any way I would rewrite your regex as this: ([a-zA-Z]{2})(-[a-zA-Z]{2})?\/c\/(deals-and-tips)?.
This way it always looks for the first part (en) and consider the second (-us) as optional.
Alternatively use (\w{2})(-\w{2})?\/c\/(deals-and-tips)?, if you don't mind risking to match underscores and similar simbols
The reason your pattern does not match us is because the alternation ([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/) only matches the \/ in the second part of the alternation.
Also it does not match the last group with deals-and-tips because there is no trailing \/ in the example data.
Your updated pattern might look like
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips)?
Regex demo
You could shorten the pattern a bit by using an optional non capturing group (?:-[a-zA-Z]{2})? inside the first capturing group to optionally match the part starting with a hyphen.
As in the example data you could match the leading \/ in front of the capturing group to get a more efficient match.
\/([a-zA-Z]{2}(?:-[a-zA-Z]{2})?)\/c\/(deals-and-tips)?
In parts
\/ To be a bit more precise, match the leading /
( Capture group 1
[a-zA-Z]{2} Match 2 chars a-z
(?:-[a-zA-Z]{2})? Optionally match - and 2 chars a-z
) Close group
\/c\/ Match /c/deals-and-tips`
(deals-and-tips)? Optional capture group 2 match deals-and-tips
Regex demo
Note that if you use another delimiter than / you don't have to escape the forward slash.

Regex lookahead. Find word without .min. in string

I'm trying to replace a link in a html file with regex and nodejs. I want to replace links without a .min.js extension.
For example, it should match "common.js" but not "common.min.js"
Here's what I've tried:
let htmlOutput = html.replace(/common\.(?!min)*js/g, common.name);
I think this negative lookahead should work but it doesn't match anything. Any help would be appreciated.
The (?!min)*js part is corrupt: you should not quantify zero-width assertions like lookaheads (they do not consume text so quantifiers after them are treated either as user errors or are ignored). Since js does not start with min this lookahead even without a quantifier is redundant.
If you want to match a string with a whole word common, then having any chars and ending with .js but not .min.js you need
/\bcommon\b(?!.*\.min\.js$).*\.js$/
See the regex demo.
Details:
\b - word boundary
common - a substring
\b - word boundary
(?!.*\.min\.js$) - immediately to the right, there should not be any 0 or more chars followed with .min.js at the end of the string
.* - any 0 or more chars
\.js - a .js substring
$ - end of string.
Here, we likely can find a simple expression to pass any char except new lines and ., after the word common, followed by .js:
common([^\.]+)?\.js
Demo
RegEx Circuit
jex.im visualizes regular expressions:
The end regex I'm using is /\bcommon[^min]+js\b/g
This will find the word common with any amount of chracters afterword except if those characters contain the word minand ending in js allowing me to replace scripts on my html page like:
script src="~/dist/common.js"
OR
script src="~/dist/common.9cf5748e0e7fc2928a07.js"
Thanks to Wiktor Stribiżew for helping me.

Use regular expressions in Visual Studio to match (non-consecutive) and replace recurring string in an expression

I am tasked to refactor namespaces in vs2015 Solution, removing duplicate/repeating words.
I need a FIND regex that returns these namespaces and everywhere that may have been used or referenced.
I need replace regex to remove the second occurrence of the word from namespace.
EXAMPLE
TestApp.SA.TestApp => TestApp.SA
TestApp.TestApp.SA => TestApp.SA
Here is my regex to Find(which I know can be better) : TestApp.*?(TestApp)
Somebody please help with an expression for replace, which I think is to set the second occurrence of TestApp to whiteSpace ?
The patterns I will suggest are not a 100% safe solution, but will show you a way to use regex for search and search and replace in your files.
The basic expressions you may use for the task are
(\w+)\.(\w+\.)*\1
and
Find: (\w+)((?:\.\w+)*)\.\1
Replace: $1$2
See the regex demo
The patterns mean:
(\w+) - match and capture 1+ alphanumeric/underscore chars into Group 1
\. - matches a literal dot
(\w+\.)* - zero or more sequences ((...)*) of 1+ word chars followed with a dot (each subsequent submatch will erase the Group 2 buffer, but it is not important when just searching)
\1 - a backreference to the contents captured in Group 1
The second pattern is almost the same, just the capturing groups are a bit adjusted for the replacement numbered backreferences to replace text correctly.