How to edit "Full Windows Folder Path Regular Expression" - regex

Hay this regualr expression working fine for Full Windows Folder Path
^([A-Za-z]:|\\{2}([-\w]+|((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\\(([^"*/:?|<>\\,;[\]+=.\x00-\x20]|\.[.\x20]*[^"*/:?|<>\\,;[\]+=.\x00-\x20])([^"*/:?|<>\\,;[\]+=\x00-\x1F]*[^"*/:?|<>\\,;[\]+=\x00-\x20])?))\\([^"*/:?|<>\\.\x00-\x20]([^"*/:?|<>\\\x00-\x1F]*[^"*/:?|<>\\.\x00-\x20])?\\)*$
Matches
d:\, \\Dpk\T c\, E:\reference\h101\, \\be\projects$\Wield\Rff\, \\70.60.44.88\T d\SPC2\
Non-Matches
j:ohn\, \\Dpk\, G:\GD, \\cae\.. ..\, \\70.60.44\T d\SPC2\
PROBLEM:
THIS EXPRESSION REQUIRED "\" END OF PATH.
HOW CAN I EDIT THIS EXPRESSION SO USER CAN ENTER PATH LIKE
C:\Folder1, C:\Folder 1\Sub Folder

There are two ways to approach this problem:
Understand the regex (way harder than necessary) and fix it to your specification (may be buggy)
Who cares how the regex does its thing (it seems to do what you need) and modify your input to conform to what you think the regex does
The second approach means that you just check if the input string ends with \. If it doesn't then just add it on, then let the regex does it magic.
I normally wouldn't recommend this ignorant alternative, but this may be an exception.
Blackboxing
Here's how I'm "solving" this problem:
There's a magic box, who knows how it works but it does 99% of the time
We want it to work 100% of the time
It's simpler to fix the 1% so it works with the magic box rather than fixing the magic box itself (because this would require understanding of how the magic box works)
Then just fix the 1% manually and leave the magic box alone
Deciphering the black magic
That said, we can certainly try to take a look at the regex. Here's the same pattern but reformatted in free-spacing/comments mode, i.e. (?x) in e.g. Java.
^
( [A-Za-z]:
| \\{2} ( [-\w]+
| (
(25[0-5]
|2[0-4][0-9]
|[01]?[0-9][0-9]?
)\.
){3}
(25[0-5]
|2[0-4][0-9]
|[01]?[0-9][0-9]?
)
)
\\ (
( [^"*/:?|<>\\,;[\]+=.\x00-\x20]
| \.[.\x20]* [^"*/:?|<>\\,;[\]+=.\x00-\x20]
)
( [^"*/:?|<>\\,;[\]+=\x00-\x1F]*
[^"*/:?|<>\\,;[\]+=\x00-\x20]
)?
)
)
\\ (
[^"*/:?|<>\\.\x00-\x20]
(
[^"*/:?|<>\\\x00-\x1F]*
[^"*/:?|<>\\.\x00-\x20]
)?
\\
)*
$
The main skeleton of the pattern is as follows:
^
(head)
\\ (
bodypart
\\
)*
$
Based from this higher-level view, it looks like an optional trailing \ can be supported by adding ? on the two \\ following the (head) part:
^
(head)
\\?(
bodypart
\\?
)*
$
References
regular-expressions.info/Question Mark for Optional
Note on catastrophic backtracking
You should generally be very wary of nesting repetition modifiers (a ? inside a * in this case), but for this specific pattern it's "okay", because the bodypart doesn't match \.
References
regular-expressions.info/Catastrophic Backtracking

I don't understand your regular expression at all. But I bet all you need to do is find the bit or bits that match the trailing "\", and add a single question mark after that bit or those bits.

The regex you provided seems to mismatch "C:\?tmp" which is an invalid windows path.
I have figured out one solution but works in windows only. You may have a try with this one:
"^[A-Za-z]:(?:\\\\(?![\"*/:?|<>\\\\,;[\\]+=.\\x00-\\x20])[^\"*/:?|<>\\\\[\\]]+){0,}(?:\\\\)?$"
This regex ignores the last "\" which hinders you.
I've tested with pcre.lib(5.5) in VS2005.
Hope it helps!

I know this question is roughly 4 years old, but the following may be sufficient:
string validWindowsOrUncPath = #"^(?:(?:[a-z]:)|(?:\\\\[^\\*\?\:;\0]*))(?:\\[^\\*\?\:;\0]*)+$";
(to be used with IgnoreCase option).
Edit:
I even came to this one, which can extract the root and each part in named groups:
string validWindowsOrUncPath = #"^(?<Root>(?:[a-z]:)|(?:\\\\[^\\*\?\:;\0]*))(?:\\(?<Part>[^\\*\?\:;\0]*))+$";

Related

RegEx - is recursive substitution possible using only a RegEx engine? Conditional search replace

I'm editing some data, and my end goal is to conditionally substitute , (comma) chars with .(dot). I have a crude solution working now, so this question is strictly for suggestions on better methods in practice, and determining what is possible with a regex engine outside of an enhanced programming environment.
I gave it a good college try, but 6 hours is enough mental grind for a Saturday, and I'm throwing in the towel. :)
I've been through about 40 SO posts on regex recursion, substitution, etc, the wiki.org on the definitions and history of regex and regular language, and a few other tutorial sites. The majority is centered around Python and PHP.
The working, crude regex (facilitating loops / search and replace by hand):
(^.*)(?<=\()(.*?)(,)(.*)(?=\))(.*$)
A snip of the input:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
room_ass=01:macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*,4,6,8,),
room_ass=01:macro_id=03: name=All, pgm_audio=1, list=(1,2*,3,4,5,6,7,8,),
And the desired output:
room_ass=01: macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*.3.5.7.),
room_ass=01: macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*.4.6.8.),
room_ass=01: macro_id=03: name=All, pgm_audio=1, list=(1.2*.3.4.5.6.7.8.),
That's all. Just replace the , with ., but only inside ( ).
This is one conceptual (not working) method I'd like to see, where the middle group<3> would loop recursively:
(^.*)(?<=\()([^,]*)([,|\d|\*]\3.*)(?=\))(.*$)
( ^ )
..where each recursive iteration would shift across the data, either 1 char or 1 comma at a time:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
iter 1-| ^ |
2-| ^ |
3-| ^ |
4-| ^|
or
A much simpler approach would be to just tell it to mask/select all , between the (), but I struck out on figuring that one out.
I use text editors a lot for little data editing tasks like this, so I'd like to verify that SublimeText can't do it before I dig into Python.
All suggestions and criticisms welcome. Be gentle. <--#n00b
Thanks in advance!
-B
Not much magic needed. Just check, if there's a closing ) ahead, without any ( in between.
,(?=[^)(]*\))
See this demo at regex101
However it does not check for an opening (. It's a common approach and probably a dulicate.
This is a complete guess because I don't use SublimeText, the assumption here is that SublimeText uses PCRE regular expressions.
Note that you mention "recursive", I don't believe you mean Regular Expression Recursion that doesn't fit the problem here.
Something like this might work...
You'll need to test to make sure this isn't matching other things in your document and to see if SublimeText even supports this...
This is based on using the /K operator to "keep" what comes before it - you can find other uses of it as an PCRE alternative (workaround) to variable look-behinds not being supported by PCRE.
Regular Expression
\((?:(?:[^,\)]+),)*?(?:[^,\)]+)\K,
Visualisation
Regex Description
Match the opening parenthesis character \(
Match the regular expression below (?:(?:[^,\)]+),)*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Match the character “,” literally ,
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Keep the text matched so far out of the overall regex match \K
Match the character “,” literally ,

Powershell script to search, split and join in one line

Been racking my Friday brain on a regex problem with dealing with Sql Server object names.
An input to my Powershell script is a procedure name. The name can take many forms, such as
dbo.Procedure
[dbo].Procedure
dbo.[Procedure.Name]
etc
So far I'd come up with the following to split the value into it's constituent parts:
[string[]] $procNameA = $procedure.Split("(?:\.)(?=(?:[^\]]|\[[^\]]*\])*$)")
In addition I have a regex that I could use to handle the square brackets
(?:\[)*([A-Za-z0-9. !]+)(?:\])*
And this is about as far as my limited regex experience will take me.
Now granted I could deal with a lot of this by treating each element in a ForEach and doing a RegEx replace there, but y'know that just seems so, I dunno, ungainly. So, question I have for any passing Powershell & RegEx guru: "How can I do all this in one line?"
What'd I'm looking for is where I can get the following results
Original Corrected
===================== =====================
dbo.ProcName [dbo].[ProcName]
dbo.[ProcName] [dbo].[ProcName]
[dbo].ProcName [dbo].[ProcName]
[dbo].[ProcName] [dbo].[ProcName]
[My.Schema].[My.Proc] [My.Schema].[My.Proc]
[My.Schema].ProcName [MySchema].[ProcName]
dbo.[ABadBADName! [dbo].[[ABadBADName!]
(Notice the last instance where an object name starts but does not end with a square bracket (not that I'm expecting that [and if I saw anyone on my team naming an object like that I'd be asking HR if I can fire them for it], but I do like to be so thorough).
Think that covers everything...
So, over to you Powershell & RegEx gurus - how do I do this?
Please limit any answers to FULLY answering the question with code I can actually use and not just syntax suggestions.
Clarification: I am acutely aware that sometimes 'slow and steady wins the race' may apply here and that support wise it would be potentially safer to handle the rest in a ForEach, but that's not the point. Part of this is to help me understand just how flexible RegEx can be, so this is more of an educational exercise rather than a philosophical one.
Okay how about this:
#'
dbo.ProcName
dbo.[ProcName]
[dbo].ProcName
[dbo].[ProcName]
[My.Schema].[My.Proc]
[My.Schema].ProcName
dbo.[ABadBADName!
'# -split '\s*\r?\n\s*' | % {
$_ -replace '^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$', '[${schema}].[${proc}]'
}
Note that I'm only using ForEach-Object (%) here to iterate through your test cases; the actual replace is done with a single regex / replace.
Explanation
So the important part here is the regex:
^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$
Breaking it down:
^ -- match the beginning of the string
(?: -- open a non-capturing group (for alternation purposes)
\[ -- match a literal left bracket [
(?<schema> -- start a named capture group, with the name schema
[^\]]+ -- match 1 or more of any character that is not a literal right square bracket ]
) -- end the schema capture group
| -- alternation; if the previous expression didn't match, try what comes after this
(?<schema> -- again start a named capture group called schema; this is only tried if the other one didn't match.
[^\.]+ -- match 1 or more of any character that is not a literal dot .
) -- end the alternate schema capture group
) -- end the non-capturing group
\. -- match a literal dot . (this is the one separating schema and proc)
(the next part for proc is exactly the same steps as above, with a different name for the capturing group)
$ -- match the end of the string
In the replace, we just qualify the names of the groups with ${name} syntax instead of the numbers $1 (which would work too actually).

PHP preg_match_all trouble

I have written a regular expression that I tested in rubular.com and it returned 4 matches. The subject of testing can be found here http://pastebin.com/49ERrzJN and the PHP code is below. For some reason the PHP code returns only the first 2 matches. How to make it to match all 4? It seems it has something to do with greediness or so.
$file = file_get_contents('x.txt');
preg_match_all('~[0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})~', $file, $m, PREG_SET_ORDER);
foreach($m as $v) echo 'S: '. $v[1]. '; N: '. $v[3]. '; D:'. $v[7]. '<br>';
Your regex is very slooooooow. After trying it on regex101.com, I found it would timeout on PHP (but not JS, for whatever reason). I'm pretty sure the timeout happens at around 50,000 steps. Actually, it makes sense now why you're not using an online PHP regex tester.
I'm not sure if this is the source of your problem, but there is a default memory limit in PHP:
memory_limit [default:] "128M"
[history:] "8M" before PHP 5.2.0, "16M" in PHP 5.2.0
If you use the multiline modifier (I assume that preg_match_all essentially adds the global modifier), you can use this regex that only takes 1282 steps to find all 4 matches:
^ [0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})
Actually, there are only 2 characters that I added. They're at the beginning, the anchor ^ and the literal space.
If you have to write a long pattern, the first thing to do is to make it readable. To do that, use the verbose mode (x modifier) that allows comments and free-spacing, and use named captures.
Then you need to make a precise description of what you are looking for:
your target takes a whole line => use the anchors ^ and $ with the modifier m, and use the \h class (that only contains horizontal white-spaces) instead of the \s class.
instead of using this kind of inefficient sub-patterns (?:(?!.....).){m,n} to describe what your field must not contain, describe what the field can contain.
use atomic groups (?>...) when needed instead of non-capturing groups to avoid useless backtracking.
in general, using precise characters classes avoids a lot of problems
pattern:
~
^ \h*+ # start of the line
# named captures # field separators
(?<VOTERNO> [0-9]+ ) \h+
(?<SURNAME> \S+ (?>\h\S+)*? ) \h{2,}
(?<OTHERNAMES> \S+ (?>\h\S+)*? ) \h{2,}
(?<DOB> [0-9]{2}-[0-9]{2}-[0-9]{4} ) \h+
(?<SEX> [FM] ) \h+
(?<APPID_RECNO> [0-9A-Z/]+ ) \h+
(?<VILLAGE> \S+ (?>\h\S+)* )
\h* $ # end of the line
~mx
demo
If you want to know what goes wrong with a pattern, you can use the function preg_last_error()

Regular Expressions: Non-Greedy with Stack?

I have to do a lot regex within LaTeX and HTML files.. and often I find my self in the following situation:
I want something like \mbox{\sqrt{2}} + \sqrt{4} to be stripped to \sqrt{2} + \sqrt{4}.
In words: "replace every occurrence of \mbox{...} by its content.
So, how do I do that?
The greedy version \mbox{(.*)} gets me \sqrt{2}} + \sqrt{4 in $1 and the
non-greedy version \mbox{(.*?)} gets me \sqrt{2 in $1.
Both is not what I want.
What I need is, that the RegEx engine keeps somehow a
Stack of characters that at the position before and behind (.*), namely { and }. So, when a new { is encountered in .*, it should be placed on stack. when a } is encountered, the last { should be removed from stack. When the stack is empty, .* is done.
Similar cases occur with nested HTML Tags.
So, since most regex engines create an FSA for each regex, a stack should be feasible, or do I miss something? Some rare modifier that I'm not aware of? I am wondering, why there is no solution for this.
Of course I could code something for my self with java/python/perl whatsoever.. but I'd like to have it integrated in RegEx :)
Regards, Gilbert
(ps: I omitted to project + \sqrt{4} to keep the example small, \ should be escaped too)
It depends on your regex engine but it is possible with the .Net regex engine as follows...
\\mbox{(
(?>
[^{}]+
| { (?<number>)
| } (?<-number>)
)*
(?(number)(?!))
)
}
Assuming you are using IgnorePatternWhiteSpace
you can then do regex.Replace(sourceText,"$1") to perform the conversion you wished
Here's another regex that works in perl http://codepad.org/fcVz9Bky :
s/
\\mbox{
(
(?:
[^{}]+ #either match any number of non-braces
| #or
\{[^{}]+} #braces surrounding non-braces
)*
)
}
/$1/x;
Note: It only works for one level of nesting
Another trick you may be able to use is a recursive regex (which should be supported by PCRE and a few other flavors):
\\mbox(\{([^{}]|(?1)+)*+\})
Not too much to explain, if you're in the right state of mind.
Here's a similar one, but a little more flexible (for example, easier to add [] and (), or other balanced constructs):
\\mbox\{([^{}]|\{(?1)*\})*\}

extract fileName using Regex

If I want to match only fileName, i.e,
in C://Directory/FileName.cs, somehow ignore everything before FileName.cs using Regex.
How can I do it?
I need this for a Compiled UI I am working on ... can't use programming language as it only accepts Regex.
Any ideas?
Something like this might work:
[^/]*$
It matches all characters to the end of the line that are not "/"..
If you want to match paths that use the "\" path separator you would change the regex to:
[^\]*$
But do make sure to escape the "\" character if your programming language or environment requires it. For instance you might have to write something like this:
[^\\]*$
EDIT
I removed the leading "/" and trailing "/" as they may be confusing since they are not really part of the regEx but they are very common of representing a regular expression.
And of course, depending on the features that the regEx engine supports you may be able to use look-ahead/look-behind and capturing to craft a better regEx.
What language are you using? Why are you not using the standard path mechanisms of that language?
How about http://msdn.microsoft.com/en-us/library/system.io.path.aspx ?
Based on your comment of needing to exclude paths that do not match 'abc', try this:
^.+/(?:(?!abc)[^/])+$
Completely split out in regex comment mode, that is:
(?x) # flag to enable comments
^ # start of line
.+ # match any character (except newline)
# greedily one or more times
/ # a literal slash character
(?: # begin non-capturing group
(?! # begin negative lookahead
# (contents must not appear after the current position)
abc # literal text abc
) # end negative lookahead
[^/] # any character that is not a slash
) # end non-capturing group
+ # repeat the above nc group one or more times
# (essentially, we keep looking for non-backspaces that are not 'abc')
$ # end of line
The regex expression that did it for me was
[^\/]*$
I'm way late to the party and I'm also ignoring the requirement of regex because, as J-16 SDiZ pointed out, sometimes there is a better solution. Even though the question is 4 years old, people looking for a simple solution deserve choices.
Try using the following:
public string ConvertFileName(string filename)
{
string[] temparray = filename.Split('\\');
filename = temparray[temparray.Length - 1];
return filename;
}
This method splits the string on the "\" character, stores the resulting strings in an array and returns the last element of the array (the filename).
Though the OP seems to be writing for UNIX, it doesn't take much to figure out how to tailor it to your particular need.
Seeing as filename can be interpreted as the basename by some. Then, this example can extract the filename/basename for any files that may not have an extension for some reason. It can also get the last directory in the same fashion.
You can see how it works and test it here.
https://regexr.com/4ht5v
The regexp is:
.+?\\(?=\w+)|\.\w+$|\\$
Before:
C:\Directory\BaseFileName.ext
C:\Directory\BaseFileName
C:\This is a Directory\Last Directory With trailing backslash\
C:\This is a Directory\Last Directory Without trailing backslash
After:
BaseFileName
BaseFileName
Last Directory With trailing backslash
Last Directory Without trailing backslash
For the sake of completion, this is how it would work with JavaScript should anyone require it.
// Example of getting a BaseFileName from a path
var path = "C:\\Directory\\FileName.cs";
var result = path.replace(/.+?\\(?=\w+)|\.\w+$|\\$/gm,"");
console.log(result);
Try this (working with / and \):
[^\/|\\]*$
I would use: ./(.$)
The parenthesis mark a group wich is the file name.
The regular expression you use may vary dependig on the regex syntax(PCRE, POSIX)
I sugest you use a regex tool, there are several for windows and linux:
Windows - http://sourceforge.net/projects/regexcreator/
Windows - http://weitz.de/regex-coach/
Linux - kodos
Hope it helps
just a variation on miky's that works for both filesystem path characters:
[^\\/]*\s
Suppose the file name has special characters, specially when supporting MAC where special characters are allowing in filenames, server side Path.GetFileName(fileName) fails and throws error because of illegal characters in path. The following code using regex come for the rescue.
The following regex take care of 2 things
In IE, when file is uploaded, the file path contains folders aswell (i.e. c:\samplefolder\subfolder\sample.xls). Expression below will replace all folders with empty string and retain the file name
When used in Mac, filename is the only thing supplied as its safari browser and allows special chars in file name.
var regExpDir = #"(^[\w]:\\)([\w].+\w\\)";
var fileName = Regex.Replace(fileName, regExpDir, string.Empty);
I did this without RegEx in Powershell:
Put the link in a variable
$Link = "http://some.url/some/path/file.name"
Split the link on the "/" character
$split = $Link.Split("/")
Count the splits
$SplitCount = $Split.Count
Target the filename
$Split[$SplitCount -1]
Full code :
$Link = "http://some.url/some/path/file.name"
$Split = $Link.Split("/")
$SplitCount = $Split.Count
$Split[$SplitCount -1]
A rather elegant solution with lookahead and lookbehind wasn't mentioned:
(?<=.+)(?=.cs)