Path Validation - My RegEx is matching leading spaces in directory names and I can't fix it - regex

I'm back again with more RegEx shenanigans.
I had what I thought was a perfect windows path validation expression.
Here it is at Regex101: https://regex101.com/r/BertHu/6
^(?:(?:[a-z]:|\\\\[a-z0-9_.$●-]+\\[a-z0-9_.$●-]+)\\|\\?[^\\\/:*?"<>|\r\n]+\\?)(?:[^\\\/:*?"<>|\r\n]+\\)*[^\\\/:*?"<>|\r\n]*$
Breakdown:
# (?:(?:[a-z]:|\\\\[a-z0-9_.$●-]+\\[a-z0-9_.$●-]+)\\| # Drive
# \\?[^\\\/:*?"<>|\r\n]+\\?) # Relative path
# (?:[^\\\/:*?"<>|\r\n]+\\)* # Folder
# [^\\\/:*?"<>|\r\n]* # File
The issue I'm having now, is that the expression is matching paths with leading spaces in directories.
Example: C:\ Leading Space\ Shouldnt Match is matching.
I tried adding [^\s] to the folder portion of the expression:
(?:[^\s][^\\\/:*?"<>|\r\n]+\\)*
But that only invalidates a leading space in the first path segment:
C:\ LeadingSpace\ShouldntMatch Doesn't match (Good)
C:\LeadingSpace\ ShouldntMatch Matches incorrectly (Bad)
I think the problem lies here:
If anyone could help or point me in the right direction that would be great.
Sorry for all the RegEx questions!

Well it depends what the exact rules are, if I take your regex101 script, as basis, I would say:
File, Folder and Relative Folder, are more or less the same (if you ignore the no-capture group and the Backslashes):
\\?[^\\\/:*?"<>|\r\n]+\\?
(?:[^\\\/:*?"<>|\r\n]+\\)*
[^\\\/:*?"<>|\r\n]*
So there are three potenital places, where folders could start with a leading space.
You could add a [^\s] infront of all of them like this
^(?:(?:[a-z]:|\\\\[a-z0-9_.$●-]+\\[a-z0-9_.$●-]+)\\|\\?([^\s][^\\\/:*?"<>|\r\n])+\\?)(?:[^\s][^\\\/:*?"<>|\r\n]+\\)*([^\s][^\\\/:*?"<>|\r\n])*$
I saved the modified regex101 https://regex101.com/r/Pd3lcR/1
Now it should work, at least for my limited testcases, and information about the restriction.
Btw.: I don't know what your use case is, but this regex is pretty long for a smiple matching and filename capture, may be there is a more readable way(for non regex people).
Update:
to fix the introduced Bug, I have to prevent the Share option matching with the relative path, by preventing a double slash with (?!\\)
^(?:(?:[a-z]:|\\\\[a-z0-9_.$●-]+\\[a-z0-9_.$●-]+)\\|\\?((?!\\)[^\\\/:*?"<>|\r\n])+\\?)(?:[^\s][^\\\/:*?"<>|\r\n]+\\)*([^\s][^\\\/:*?"<>|\r\n])*$
here the updated regex: https://regex101.com/r/RMVkTC/3
Update (Version 2):
I rewrote the regex to the way I would create it. It is not perfectly optimized(short), but this way it is easier to test/bugfix.
The RegExp is exactly 3 parts, piped together:
Drive + path + folder/file: (^[a-z]:\\([^\s][^\\\/:*?"<>|\r\n]+\\)*[^\s][^\\\/:*?"<>|\r\n]+$)
relativepath + folder/file: (^(\.?\.?\\[^\\])*([^\\\/:*?"<>|\r\n]+\\?)*$)
Share + folder/file: (^\\\\[a-z0-9_.$●-]+\\[a-z0-9_.$●-]+\\([^\s][^\\\/:*?"<>|\r\n]*\\?)*$)
like this, if you have to change something for one edge case, it is more contained and easy to adapt.
here the updated regex: https://regex101.com/r/Qxj3Ni/1

Related

Is it possible to remove the slash in this matching?

I want to extend my regexp for filepaths matching and I don't know how to do it even if I see the problem.
Innput example
"C://species/dinosaurs/trex.json"
Ouput example
["C://species/dinosaurs" "trex" "json"]
so that I have the folder path, the filename and the extension.
I also want the folder path to be optional
My regexp
I tried
"^(.*[\\\/])?(.*)\.(.*)$"
It outputs
["C://species/dinosaurs/" "trex" "json"]
Almost but I have the / at the end of the head
I so tried
"^((.*)[\\\/])?(.*)\.(.*)$"
I ouputs
["C://species/dinosaurs/" "C://species/dinosaurs" "trex" "json"]
Maybe better because I juste have to remove the first match whereas in the first case I have to post-process the string.
I see the problem because several / can exist in the body so that it is harder.
Is it possible to say that the end of the first matching group can be all but not /.
I tried
^(.*(?!\/))[\\\/]?(.*)\.(.*)$
Does not work. I just discovered negative assertions but the output is
["C://species/dinosaurs/trex" "json"]
Any clue ?
This one should suit your needs:
^(?:(.*)/)?([^/]+)\.([^.]+)$
Visualization by Debuggex

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Regexp-replace: Multiple replacements within a match

I'm converting our MVC3 project to use T4MVC. And I would like to replace java-script includes to work with T4MVC as well. So I need to replace
"~/Scripts/DataTables/TableTools/TableTools.min.js"
"~/Scripts/jquery-ui-1.8.24.min.js"
Into
Scripts.DataTables.TableTools.TableTools_min_js
Scripts.jquery_ui_1_8_24_min_js
I'm using Notepad++ as a regexp tool at the moment, and it is using POSIX regexps.
I can find script name and replace it with these regexps:
Find: \("~/Scripts/(.*)"\)
Replace with \(Scripts.\1\)
But I can't figure out how do I replace dots and dashes in the file names into underscores and replace forward slashes into dots.
I can check that js-filename have dot or dash in a name with this
\("~/Scripts/(?=\.*)(?=\-*).*"\)
But how do I replace groups within a group?
Need to have non-greedy replacement within group, and have these replacements going in an order, so forward slashes converted into a dot will not be converted to underscore afterwards.
This is a non-critical problem, I've already done all the replacements manually, but I thought I'm good with regexp, so this problem bugs me!!
p.s. preferred tool is Notepad++, but any POSIX regexp solution would do -)
p.p.s. Here you can get a sample of stuff to be replaced
And here is the the target text
I would just use a site like RegexHero
You can past the code into the target string box, then place (?<=(~/Script).*)[.-](?=(.*"[)]")) into the Regular Expression box, with _ in the Replacement String box.
Once the replace is done, click on Final String at the bottom, and select Move to target string and start a new expression.
From there, Paste (?<=(<script).*)("~/)(?=(.*[)]" ))|(?<=(Url.).*)(")(?=(.*(\)" ))) into the Regular Expression box and leave the Replacement String box empty.
Once the replace is done, click on Final String at the bottom, and select Move to target string and start a new expression.
From there paste (?<=(Script).*)[/](?=(.*[)]")) into the Regular Expression box and . into the Replacement String box.
After that, the Final String box will have what you are looking for. I'm not sure the upper limits of how much text you can parse, but it could be broken up if that's an issue. I'm sure there might be better ways to do it, but this tends to be the way I go about things like this. One reason I like this site, is because I don't have to install anything, so I can do it anywhere quickly.
Edit 1: Per the comments, I have moved step 3 to Step 5 and added new steps 3 and 4. I had to do it this way, because new Step 5 would have replaced the / in "~/Scripts with a ., breaking the removal of "~/. I also had to change Step 5's code to account for the changed beginning of Script
Here is a vanilla Notepad++ solution, but it's certainly not the most elegant one. I managed to do the transformation with several passes over the file.
First pass
Replace . and - with _.
Find: ("~/Scripts[^"]*?)[.-]
Replace With: \1_
Unfortunately, I could not find a way to match only the . or -, because it would require a lookbehind, which is apparently not supported by Notepad++. Due to this, every time you execute the replacement only the first . or - in a script name will be replaced (because matches cannot overlap). Hence, you have to run this replacement multiple times until no more replacements are done (in your example input, that would be 8 times).
Second pass
Replace / with ..
Find: ("~/Scripts[^"]*?)/
Replace with: \1.
This is basically the same thing as the first pass, just with different characters (you will have to this 3 times for the example file). Doing the passes in this order ensures that no slashes will end up as underscores.
Third pass
Remove the surrounding characters.
Find: "~/(Scripts[^"]*?)"
Replace with: \1
This will now match all the script names that are still surrounded by "~/ and ", capturing what is in between and just outputting that.
Note that by including those surrounding characters in the find patterns of the first two passes, you can avoid converting the . in strings that are already of the new format.
As I said this is not the most convenient way to do it. Especially, since passes one and two have to be executed manually multiple times. But it would still save a lot of time for large files, and I cannot think of a way to get all of them - only in the correct strings - in one pass, without lookbehind capabilities. Of course, I would very much welcome suggestions to improve this solution :). I hope I could at least give you (and anyone with a similar problem) a starting point.
If, as your question indicates, you'd like to use N++ then use N++ Python Script. Setup the script and assign a shortcut key, then you have a single pass solution requiring only to open, modify, and save... can't get much simpler than that.
I think part of the problem is that N++ is not a regex tool and the use of a dedicated regex tool
, or even a search/replace solution, is sometimes warranted. You may be better off, both in speed and in time value using a tool made for text processing vs editing.
[Script Edit]:: Altered to match the modified in/out expectations.
# Substitute & Replace within matched group.
from Npp import *
import re
def repl(m):
return "(Scripts." + re.sub( "[-.]", "_", m.group(1) ).replace( "/", "." ) + ")"
editor.pyreplace( '(?:[(].*?Scripts.)(.*?)(?:"?[)])', repl )
Install:: Plugins -> Plugin Manager -> Python Script
New Script:: Plugins -> Python Script -> script-name.py
Select target tab.
Run:: Plugins -> Python Script -> Scripts -> script-name
[Edit: An extended one-liner PythonScript command]
Having need for the new regex module for Python (that I hope replaces re) I played around and compiled it for use with the N++ PythonScript plugin and decided to test it on your sample set.
Two commands on the console ended up with the correct results in the editor.
import regex as re
editor.setText( (re.compile( r'(?<=.*Content[(].*)((?<omit>["~]+?([~])[/]|["])|(?<toUnderscore>[-.]+)|(?<toDot>[/]+))+(?=.*[)]".*)' ) ).sub(lambda m: {'omit':'','toDot':'.','toUnderscore':'_'}[[ key for key, value in m.groupdict().items() if value != None ][0]], editor.getText() ) )
Very sweet!
What else is really cool about using regex instead of re was that I was able to build the expression in Expresso and use it as is! Which allows for a verbose explanation of it, just by copy-paste of the r'' string portion into Expresso.
The abbreviated text of which is::
Match a prefix but exclude it from the capture. [.*Content[(].*]
[1]: A numbered capture group. [(?<omit>["~]+?([~])[/]|["])|(?<toUnderscore>[-.]+)|(?<toDot>[/]+)], one or more repetitions
Select from 3 alternatives
[omit]: A named capture group. [["~]+?([~])[/]|["]]
Select from 2 alternatives
["~]+?([~])[/]
Any character in this class: ["]
[toUnderscore]: A named capture group. [[-.]+]
[toDot]: A named capture group. [[/]+]
Match a suffix but exclude it from the capture. [.*[)]".*]
The command breakdown is fairly nifty, we are telling Scintilla to set the full buffer contents to the results of a compiled regex substitution command by essentially using a 'switch' off of the name of the group that isn't empty.
Hopefully Dave (the PythonScript Author) will add the regex module to the ExtraPythonLibs part of the project.
Alternatively you could use a script that would do it and avoid copy pasting and the rest of the manual labor altogether. Consider using the following script:
$_.gsub!(%r{(?:"~/)?Scripts/([a-z0-9./-]+)"?}i) do |i|
'Scripts.' + $1.split('/').map { |i| i.gsub(/[.-]/, '_') }.join('.')
end
And run it like this:
$ ruby -pi.bak script.rb *.ext
All the files with extension .ext will be edited in-place and the original files will be saved with .ext.bak extension. If you use revision control (and you should) then you can easily review changes with some visual diff tool, correct them if necessary and commit them afterwards.

Regex for all files except .hg_keep

I use empty .hg_keep files to keep some (otherwise empty) folders in Mercurial.
The problem is that I can't find a working regex which excludes everything but the .hg_keep files.
lets say we have this filestructure:
a/b/c2/.hg_keep
a/b/c/d/.hg_keep
a/b/c/d/file1
a/b/c/d2/.hg_keep
a/b/.hg_keep
a/b/file2
a/b/file1
a/.hg_keep
a/file2
a/file1
and I want to keep only the .hg_keep files under a/b/.
with the help of http://gskinner.com/RegExr/ I created the following .hgignore:
syntax: regexp
.*b.*/(?!.*\.hg_keep)
but Mercurial ignores all .hg_keep files in subfolders of b.
# hg status
? .hgignore
? a/.hg_keep
? a/b/.hg_keep
? a/file1
? a/file
# hg status -i
I a/b/c/d/.hg_keep
I a/b/c/d/file1
I a/b/c/d2/.hg_keep
I a/b/c2/.hg_keep
I a/b/file1
I a/b/file2
I know that I a can hd add all the .hg_keep files, but is there a solution with a regular expression (or glob)?
Regexp negation might work for this. If you want to ignore everything except the a/b/.hg_keep file, you can probably use:
^(?!a/b/\.hg_keep)$
The parts of this regexp that matter are:
^ anchor the match to the beginning of the file path
(?! ... ) negation of the expression between '!' and ')'
a/b/\.hg_keep the full path of the file you want to match
$ anchor the match to the end of the file path
The regular expression
^a/b/\.hg_keep$
would match only the file called a/b/.hg_keep.
Its negation
^(?!a/b/\.hg_keep)$
will match everything else.
Not quite sure in what context you are using the Regex but this should be it, this matches all lines ending in .hg_keep:
^.*\.hg_keep$
EDIT: And here is a Regex to match items not matching the above expression:
^(?:(?!.*\.hg_keep).)*$
Try (?!.*/\.hg_keep$).
Looking for something similiar to this.
Found an answer, but it's not what we want to hear.
Limitations
There is no straightforward way to ignore all but a set of files. Attempting to use an inverted regex match will fail when combined with other patterns. This is an intentional limitation, as alternate formats were all considered far too likely to confuse users to be worth the additional flexibility.
Ref: https://www.mercurial-scm.org/wiki/.hgignore

hgignore: help ignoring all files but certain ones

I need an .hgdontignore file :-) to include certain files and exclude everything else in a directory. Basically I want to include only the .jar files in a particular directory and nothing else. How can I do this? I'm not that skilled in regular expression syntax. Or can I do it with glob syntax? (I prefer that for readability)
Just as an example location, let's say I want to exclude all files under foo/bar/ except for foo/bar/*.jar.
The answer from Michael is a fine one, but another option is to just exclude:
foo/bar/**
and then manually add the .jar files. You can always add files that are excluded by an ignore rule and it overrides the ignore. You just have to remember to add any jars you create in the future.
To do this, you'll need to use this regular expression:
foo/bar/.+?\.(?!jar).+
Explanation
You are telling it what to ignore, so this expression is searching for things you don't want.
You look for any file whose name (including relative directory) includes (foo/bar/)
You then look for any characters that precede a period ( .+?\. == match one or more characters of any time until you reach the period character)
You then make sure it doesn't have the "jar" ending (?!jar) (This is called a negative look ahead
Finally you grab the ending it does have (.+)
Regular expressions are easy to mess up, so I strongly suggest that you get a tool like Regex Buddy to help you build them. It will break down a regex into plain English which really helps.
EDIT
Hey Jason S, you caught me, it does miss those files.
This corrected regex will work for every example you listed:
foo/bar/(?!.*\.jar$).+
It finds:
foo/bar/baz.txt
foo/bar/baz
foo/bar/jar
foo/bar/baz.jar.txt
foo/bar/baz.jar.
foo/bar/baz.
foo/bar/baz.txt.
But does not find
foo/bar/baz.jar
New Explanation
This says look for files in "foo/bar/" , then do not match if there are zero or more characters followed by ".jar" and then no more characters ($ means end of the line), then, if that isn't the case, match any following characters.
Anyone that wants to use negative lookaheads (or ?! in regex syntax) or any kind of back-referencing mechanism should be aware that Mercurial will fall back from google's RE2 to Python's re module for matching.
RE2 is a non-backtracking engine that guarantees a run-time linear with the size of the input. If performance is important to you, that is if you have a big repository, you should consider sticking to more simple patterns that Re2 supports, which is why I think that the solution offered by Ryan.