Improve regex for capturing files in a directory, excluding dotfiles - regex

I am looking to get all non dot-files in a folder with a particular extension. So far my regex is:
(?<=\/|^)(?<!\.)(\w+(?:\.mov|\.py|))$
Is there a way to improve the above regex? What might be some examples where this regex might not work?

The \w+ will only match one or more letters, digits or _. It will not match the rest of the chars that may constitute a valid file name. Also, your (?<!\.) lookbehind is redundant because the previous lookbehind already excludes a dot at that position.
Besides, you do not have to repeat the comma pattern, you may use grouping for extensions only.
You may use
(?<=\/|^)([^\/]+)(\.(?:mov|py))$
See this regex demo
(?<=\/|^) - / or start of string allowed immediately on the left
([^\/]+) - Group 1: any one or more chars other than /
(\.(?:mov|py)) - Group 2: a . char and then either mov or py
$ - end of string/
Note you may also replace (?<=\/|^) with (?<![^\/]) in real code since it will work the same with standalone strings. It will mess the demo results at regex101.com because there, you test against a single multiline string (that is why I added \n to the negated character class there, too).

Here's how I would do it:
(?<=\/|^)[^\/\\:*?"<>|\n]+\.(?:mov|py)$
(?<=\/|^) Lookbehind just like you had it
[^\/\\:*?"<>|\n]+ One or more of any character that is not disallowed in filenames
\. A literal dot
(?:mov|py) Either "mov" or "py" in a non-capturing group (similar to yours, but I moved the dot out and excluded the redundant "|")
$ Anchors the search to the end of the line, so only files will match, no folders

Related

Regex to check for certain extensions or no extension and is only 0-9, a-z and hyphens

I'm looking for a regex expression to only match certain filenames and extensions.
The filename may or may not have an extension e.g. test and test.txt are valid, but if it does have an extension then it must be limited to certain ones e.g. only .txt or .md but only those 2. It also needs to just be limited to a-z and 0-9 and hyphens/dashes, but should not end with a dash.
Not sure it helps but I've listed some valid and invalid ones below. I'm using an an existing regex that works fine without extensions - ^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*)$ but as soon as I bring extensions into it I can't seem to find a solution. I looked at several answers and Regex to check if file does not have an extension is close but this allows characters are aren't a-z or 0-9 or hyphens and I couldn't work out how to correct it.
Valid/matching
test
test.txt
test.md
test-one
test-one.md
Invalid/non-matching
test.jpg
test_one
test_one.jpeg
test-
How to match extensions?
As I mentioned in my original comment, your regex is well-formed; you just need to add an optional group for the extensions: (\.(md|txt))?
I also switched the order of the first two groups to make it more efficient (prevents unneeded backtracking if no - is found)
The regex below adds this logic to your pattern. I also removed the capture group surrounding the entire pattern as it's not necessary. If you want, you can always use the second pattern to get each part into a different group. If you don't need any groups, use the third pattern below (assuming your regex engine supports non-capture groups), and if you require two groups: one for the extension and one for the filename, use the fourth pattern below (with the same assumption):
# 1 - minimally changed original pattern
^([a-zA-Z0-9]+-)*[a-zA-Z0-9]+(\.(md|txt))?$
# 2 - filename parts into groups
^(([a-zA-Z0-9]+-)*[a-zA-Z0-9]+)(\.(md|txt))?$
# 3 - no captures
^(?:[a-zA-Z0-9]+-)*[a-zA-Z0-9]+(?:\.(?:md|txt))?$
# 4 - filename and extension in groups
^((?:[a-zA-Z0-9]+-)*[a-zA-Z0-9]+)(\.(?:md|txt))?$ # captures .ext in 2nd group
^((?:[a-zA-Z0-9]+-)*[a-zA-Z0-9]+)(?:\.(md|txt))?$ # captures ext in 2nd group
How to make it shorter?
Additionally, you can substitute [a-zA-Z0-9] for the following character sets in some regex engines:
# any letter in range a-z or any digit
# use case-insensitive flag to also match A-Z
[a-z\d]
# any character that's not not a word, and not _
# in other words, any word character ([a-zA-Z0-9_]) except _ ; so `[a-zA-Z0-9]`
[^\W_]
Shortest pattern:
^([^\W_]-?)*[^\W_](\.(md|txt))?$
How to make it more efficient?
Most efficient pattern (you can use any of the character class substitutions without changing the number of steps that this pattern takes to complete - I defaulted it to the shortest version of [^\W_]):
^([^\W_]+-)*[^\W_]+(\.(md|txt))?$
# if your regex engine accepts possessive quantifiers, use this to prevent backtracking
^([^\W_]+-)*+[^\W_]++(\.(md|txt))?$
^^ ^^
This may work:
(^([a-zA-Z0-9\-_]+)$|^([a-zA-Z0-9\-_]+\.(txt|md))$)

How to end a string with $ directly after .* with a RegEx?

I'm trying to report on a set of URLs that catches all potential URL parameters and I'm having an issue defining the RegEx properly.
We have this RegEx to capture a few variations of our URLs to feed into our reporting but I need to be able to end the string with a $ but when I do, it doesn't show any results.
The RegEx:
/join/$|/join/\?product.*|/join/\.*
For another account, we only use one variation which is outlined below (which works):
^/join/$
I believe the issue is in that after \?product.*, I'm not ending the string (or even starting it).
So far I have tried: ^/join/$|(^[/join/\?product.*]$)|(^[/join/\.*]$) with no luck.
If you want to match the dollar sign literally you have to escape it \$ or else it would mean an anchor to assert the end of the string / line.
This pattern ^/join/$ would therefore only match /join/
In your pattern you use an alternation where the last part /join/\.* would match /join/ but also /join/..... because when you escape the dot you will match it literally and the * quantifier repeats 0+ times.
Perhaps you are looking for:
^/join/(?:\?product.*\$)?$
This will match /join/ followed by an optional part (?:\?product.*\$)? that will match ?product, followed by any char 0+ times and will end on $.
Regex demo
Please, make the pattern lazy and $ is a special character for regex so need to escape that. (Regarding escaping part, google analytics may follow something else.) [] is used to capture a character in a range, be careful with that as well, as you are trying to capture a group I think.
\?product.*?\$

Regular expression for variable routes

I have the following directory:
Videos/common/Project/Project01/video.project_01.StatusOK/video.project_01.StatusOK.csproj
And the regular expression that I use to extract only with the last part of the route (video.project_01.StatusOK.csproj) is the following:
([\w|.])/Project/([\w|.|\s])/([\w|.|\s])/([\w|.|\s])([.]*)
The problem is that if the route varies, that is if there is a directory before: video.project_01.StatusOK.csproj, for example like this: Videos/common/Project/Project01/video.project_01.StatusOK/test/video.project_01. StatusOK.csproj, I would extract 'test'.
Let's see if someone can help me with a regular expression for java, always extract the last part which contains the '.csproj', whatever the route.
Regards, and thank you very much
Try this Regex:
(?<=\/)[^\/]+csproj
Click for Demo
See JAVA code HERE
Explanation:
(?<=\/) - positive lookbehind to find the position immediately preceded by a /
[^\/]+ - matches 1+ occurrences of any character that is not a /
csproj - matches csproj literally
In case you are unaware, Java 7 introduced NIO2 which brought a new interface java.nio.file.Path. You can break up the path to your directory and then use a regular expression on each part of the path.
Oracle's Java Tutorial has a section on Path Operations
(There is also a section on Regular Expressions)
If you want to keep to the /Project/ in your path, you could try this:
.*?/Project/.*?(?<=\/)([\w+. ]+\.csproj)$
That would match
match any character zero or more times non greedy (.*?)
match /Project/
match any character zero or more times non greedy (.*?)
positive lookbehind that asserts that what is before is a forward slash (?<=\/)
A capturing group ( this will contain your match
A character class that will match one or more word characters, dot or whitespace [\w. ]+ one or more times
Match .csproj \.csproj
Close the capturing group )
The end of the string $

RegEx expression not allowing only spaces?

I have this regEx expression which allows only spaces, letters and dashes. I'd like to modify it so it wouldn't allow ONLY spaces too. Can someone help me ?
/^([A-zăâîșțĂÂÎȘȚ-\s])+$/
You can use a negative lookahead to restrict this generic pattern:
/^(?!\s+$)[A-Za-zăâîșțĂÂÎȘȚ\s-]+$/
^^^^^^^^
See the regex demo
The (?!\s+$) lookahead is executed once at the very beginning and returns false if there are 1 or more whitespaces until the end of the string.
Also, your regex contained a classical issue of [A-z] that matches more than just ASCII letters, you need to replace this with [A-Za-z] (or just [a-z] and use the /i case insensitive modifier).
Also, the - inside a character class is usually placed at the end so as not to escape it, and it will be parsed as a literal hyphen (however, you might want to escape it if another developer will have to update this pattern by adding more symbols to the character class).
And just in case this is a regex engine that does not support lookarounds:
^[A-Za-zăâîșțĂÂÎȘȚ\s-]*[A-Za-zăâîșțĂÂÎȘȚ-][A-Za-zăâîșțĂÂÎȘȚ\s-]*$
It requires at least 1 non-space character from the allowed set (also matching 1 obligatory symbol).
Another regex demo

Correct match using RegEx but it should work without substitution

I have <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] to catch everything inside
<autorpodpis>_this_is_an_example_of_what_I'd_like_to_match< If there is a space, a colon (;) or a semicolon (;) or a space before a colon or a semicolon, my RegEx catches everything but including these characters – see my link. It works as it is expected to.
Overall, the RegEx works fine with substitution \1 (or in AutoHotKey I use – $1). But I'd like match without using substitution.
You seem to mix the terms substitution (regex based replacement operation) and capturing (storing a part of the matched value captured with a part of a pattern enclosed with a pair of unescaped parentheses inside a numbered or named stack).
If you want to just match a substring in specific context without capturing any subvalues, you might consider using lookarounds (lookbehind or lookahead).
In your case, since you need to match a string after some known string, you need a lookbehind. A lookbehind tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.
So, you could use
pos := RegExMatch(input, "(?<=<autorpodpis>)\p{L}+(?:\s+\p{L}+)*", Res)
So, the Res should have WOJCIECH ZAŁUSKA if you supply <autorpodpis>WOJCIECH ZAŁUSKA</autorpodpis> as input.
Explanation:
(?<=<autorpodpis>) - check if there is <autorpodpis> right before the currently tested location. If there is none, fail this match, go on to the next location in string
\p{L}+ - 1+ Unicode letters
(?:\s+\p{L}+)* - 0+ sequences of 1+ whitespaces followed with 1+ Unicode letters.
However, in most cases, and always in cases like this when the pattern in the lookbehind is known, the lookbehind is unanchored (say, when it is the first subpattern in the pattern) and you do not need overlapping matches, use capturing.
The version with capturing in place:
pos := RegExMatch(input, "<autorpodpis>(\p{L}+(?:\s+\p{L}+)*)", Res)
And then Res[1] will hold the WOJCIECH ZAŁUSKA value. Capturing is in most cases (96%) faster.
Now, your regex - <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] - is not efficient as the [^;,<\n\r] also matches \s and \s matches [;,<\n\r]. My regex is linear, each subsequent subpattern does not match the previous one.