regex matching whole with a few criteria - regex

(I want to match the whole line, the purpose is this, in python, I will list all the files in a directory, then I want to pick those file urls based on certain keywords, ie 'qwert2asdf' and 'windows'):
My current regex:
[a-zA-Z0-9_.\-\\]*(qwert2asdf)[a-zA-Z0-9_.\-\\]*(windows)[a-zA-Z0-9_.\-\\]*
matches line #4 which is what I need
4\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
question 1.is there a better way so I don't have to repeat [a-zA-Z0-9_.-\]*
question 2. how do I make the match so that it ignores the order of 'qwert2asdf' and 'windows', that is if 'windows' happen before 'qwert2asdf' and it'll still match?
1\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\css_boxt_pkg_isys.abcdefg_urururur_20140701_1815.linux.tar.gz
2\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\bbb_pkg_all_systems.abcdefg_urururur_20140701_1815.tar.gz
3\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.linux.tar.gz
4\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
5\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815_vp.tar.gz
6\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.linux.tar.gz
7\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
8\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815_vp.tar.gz
9\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_system.abcdefg_urururur_20140701_1815.tar.gz
10\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\doc_pkg_evih_iii_ass_system.abcdefg_urururur_20140701_1815.tar.gz
11\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2400_system.abcdefg_urururur_20140701_1815.linux.tar.gz
12\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2400_system.abcdefg_urururur_20140701_1815.windows.tar.gz
13\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2400_system.abcdefg_urururur_20140701_1815_vp.tar.gz
14\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\css_pkg_css_skm_cgdsg0_system.abcdefg_urururur_20140701_1815.tar.gz
15\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\css_pkg_css_skm_asdfgt_system.abcdefg_urururur_20140701_1815.tar.gz
16\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_boxtppc_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.linux.tar.gz
17\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_boxtppc_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
18\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_boxtppc_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815_vp.tar.gz
19\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ia_css_2.1.3.0.abcdefg_urururur_20140701_1815.linux.tar.gz
20\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ia_css_2.1.3.0.abcdefg_urururur_20140701_1815.windows.tar.gz
21\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\doc_pkg_ia_css_2.1.3.0.abcdefg_urururur_20140701_1815.tar.gz

You can use Positive Lookahead here.
^(?=.*qwert2asdf)(?=.*windows)[\w\\.-]*$
Explanation:
^ # the beginning of the string
(?= # look ahead to see if there is:
.* # any character except \n (0 or more times)
qwert2asdf # 'qwert2asdf'
) # end of look-ahead
(?= # look ahead to see if there is:
.* # any character except \n (0 or more times)
windows # 'windows'
) # end of look-ahead
[\w\\.-]* # any character of: word characters (a-z, A-Z, 0-9, _),
# '\\', '.', '-' (0 or more times)
$ # before an optional \n, and the end of the string
Live Demo

This should work:
(.*?)windows(.*)

You did not say which regex flavor you are using (POSIX, Perl, Java, ...), but I am unaware of any that has a way to write a pattern that matches the same set of strings as yours without repeating the character class as yours does.
You might be tempted to look at back references, but they do not do what you want.
Depending on the host language, however, you might be able to reduce duplication by putting the text of the character class into a variable, and interpolating the variable into your regular expression at each of the three points.
Matching regardless of the order of the 'qwert2asdf' and 'windows' substrings is messier, but it can be done. Here's one way that should work in pretty much any regex engine, modulo any metacharacter (non-)escaping that might need to be performed:
[a-zA-Z0-9_.\-\\]*((qwert2asdf)[a-zA-Z0-9_.\-\\]*(windows)|(windows)[a-zA-Z0-9_.\-\\]*(qwert2asdf))[a-zA-Z0-9_.\-\\]*
A regex engine that supports zero-width lookbehind assertions would provide other alternatives, but I don't think any would come out shorter.

Related

Improve regex for capturing files in a directory, excluding dotfiles

I am looking to get all non dot-files in a folder with a particular extension. So far my regex is:
(?<=\/|^)(?<!\.)(\w+(?:\.mov|\.py|))$
Is there a way to improve the above regex? What might be some examples where this regex might not work?
The \w+ will only match one or more letters, digits or _. It will not match the rest of the chars that may constitute a valid file name. Also, your (?<!\.) lookbehind is redundant because the previous lookbehind already excludes a dot at that position.
Besides, you do not have to repeat the comma pattern, you may use grouping for extensions only.
You may use
(?<=\/|^)([^\/]+)(\.(?:mov|py))$
See this regex demo
(?<=\/|^) - / or start of string allowed immediately on the left
([^\/]+) - Group 1: any one or more chars other than /
(\.(?:mov|py)) - Group 2: a . char and then either mov or py
$ - end of string/
Note you may also replace (?<=\/|^) with (?<![^\/]) in real code since it will work the same with standalone strings. It will mess the demo results at regex101.com because there, you test against a single multiline string (that is why I added \n to the negated character class there, too).
Here's how I would do it:
(?<=\/|^)[^\/\\:*?"<>|\n]+\.(?:mov|py)$
(?<=\/|^) Lookbehind just like you had it
[^\/\\:*?"<>|\n]+ One or more of any character that is not disallowed in filenames
\. A literal dot
(?:mov|py) Either "mov" or "py" in a non-capturing group (similar to yours, but I moved the dot out and excluded the redundant "|")
$ Anchors the search to the end of the line, so only files will match, no folders

Which would be better non-greedy regex or negated character class?

I need to match #anything_here# from a string #anything_here#dhhhd#shdjhjs#. So I'd used following regex.
^#.*?#
or
^#[^#]*#
Both way it's work but I would like to know which one would be a better solution. Regex with non-greedy repetition or regex with negated character class?
Negated character classes should usually be prefered over lazy matching, if possible.
If the regex is successful, ^#[^#]*# can match the content between #s in a single step, while ^#.*?# needs to expand for each character between #s.
When failing (for the case of no ending #) most regex engines will apply a little magic and internally treat [^#]* as [^#]*+, as there is a clear cut border between # and non-#, thus it will match to the end of the string, recognize the missing # and not backtrack, but instantly fail. .*? will expand character for character as usual.
When used in larger contexts, [^#]* will also never expand over the borders of the ending # while this is very well possible for the lazy matching. E.g. ^#[^#]*a[^#]*# won't match #bbbb#a# while ^#.*?a.*?# will.
Note that [^#] will also match newlines, while . doesn't (in most regex engines and unless used in singleline mode). You can avoid this by adding the newline character to the negation - if it is not wanted.
It is clear the ^#[^#]*# option is much better.
The negated character class is quantified greedily which means the regex engine grabs 0 or more chars other than # right away, as many as possible. See this regex demo and matching:
When you use a lazy dot matching pattern, the engine matches #, then tries to match the trailing # (skipping the .*?). It does not find the # at Index 1, so the .*? matches the a char. This .*? pattern expands as many times as there are chars other than # up to the first #.
See the lazy dot matching based pattern demo here and here is the matching steps:

How to match periods not at the end of paragraphs?

If I want to find all periods that ARE at the end of paragraphs, I could do \.($|\n). But how can I negate that and say "a period followed by any character that ISN'T one of these, given that metacharacters don't work inside character classes, which stops me using negated character classes?
What's in a $? It depends!
The answer very much depends on which language and regex engine you're using. You see,
In Java, the $ asserts that we are positioned at the end of the string or before any carriage return or newline at the end of the string. So you'd be safe with a \.(?!$)
In PCRE, C# and Python, the $ asserts that we are positioned at the end of the string or before any newline at the end of the string. So you'd could use a \.(?!$|\r)
In JavaScript and Ruby, the $ asserts that we are positioned at the end of the string. So you'd need to go the full Monty with a \.(?!$|[\r\n]).
Therefore, for a multi-engine solution, the safest would be:
\.(?!$|[\r\n])
But in the right context, the other two options are perfectly acceptable.
Explanation
\. matches the literal period
The negative lookahead (?!$|[\r\n]) asserts that what follows is neither the "end of the string" nor a carriage return nor a newline.
Use a Negative Lookahead to do this.
\.(?!\n|$)
Explanation:
\. '.'
(?! look ahead to see if there is not:
\n '\n' (newline)
| OR
$ before an optional \n, and the end of the string
) end of look-ahead
Live Demo
The most useful longhand version of the negatively looked ahead EOL check after the period winds up making your entire pattern something like this:
(?x: # enable comments
\. # a literal dot character
(?! # look ahead for not the following{
\R ? # optional EOL grapheme cluster
\z # at the true end of string
) # } end look ahead
)
That assumes you don’t want it match “interstitially” (that is, before any line-terminator grapheme), which would be the simpler:
(?=\R)
Some argument can be made for that \R? being made into a \R* instead, in case you should happen to have multiple line-terminators at the end of a record, like several newlines in a row. That way 0, 1, 2, or however many EOL graphemes are allowed before the end of the string.
On the other hand, it may well be the case that a paragraph must be at least two EOL graphemes, not just one alone. For example, this is true in markup here and in other files with “blank-line separated” types of paragraphs. So no EOLs are ok, and two or more are too, but not just one of them.
For such text, you would need \R{2,}, but the whole bit would be optionalized, yielding in that case:
(?x: # enable comments
\. # a literal dot character
(?! # look ahead for NOT the following {
(?:
\R {2,} # two or more EOL grapheme clusters
) ? # # optionally
\z # at the true end of string
) # } end negated look ahead
)
If you don’t have \R from UTS 18: Unicode Regular Expressions — Line Boundaries in your regex flavor, then you will have to write it out the hard way, which is the rather annoying:
(?x: # We are emulating \R per UTS#18
(?> # Prohibit backtrack within subpattern
\r \n # Match a CRLF without backtracking
# or else any code point with the
# vertical space character property
# \p{VertSpace}, here enumerated in full
| [\x0A-\x0D\x85\x{2028}\x{2029}]
)
)
You need the no-backtracking bit to avoid something like \R{2} being allowed to match a single CRLF, and it isn’t allowed to do that.
One final thing to consider is whether you want to allow for optional horizontal whitespace to intervene between the period and the EOL. I rather imagine that you do, but without a tighter formal specification in the OP, it’s impossible to say so definitely.
You should use a negative lookahead.
\.(?!$|\n)
More on this: http://www.regular-expressions.info/lookaround.html

Nested regex lookahead and lookbehind

I am having problems with the nested '+'/'-' lookahead/lookbehind in regex.
Let's say that I want to change the '*' in a string with '%' and let's say that '\' escapes the next character. (Turning a regex to sql like command ^^).
So the string
'*test*' should be changed to '%test%',
'\\*test\\*' -> '\\%test\\%', but
'\*test\*' and '\\\*test\\\*' should stay the same.
I tried:
(?<!\\)(?=\\\\)*\* but this doesn't work
(?<!\\)((?=\\\\)*\*) ...
(?<!\\(?=\\\\)*)\* ...
(?=(?<!\\)(?=\\\\)*)\* ...
What is the correct regex that will match the '*'s in examples given above?
What is the difference between (?<!\\(?=\\\\)*)\* and (?=(?<!\\)(?=\\\\)*)\* or if these are essentially wrong the difference between regex that have such a visual construction?
To find an unescaped character, you would look for a character that is preceded by an even number of (or zero) escape characters. This is relatively straight-forward.
(?<=(?<!\\)(?:\\\\)*)\* # this is explained in Tim Pietzcker' answer
Unfortunately, many regex engines do not support variable-length look-behind, so we have to substitute with look-ahead:
(?=(?<!\\)(?:\\\\)*\*)(\\*)\* # also look at ridgerunner's improved version
Replace this with the contents of group 1 and a % sign.
Explanation
(?= # start look-ahead
(?<!\\) # a position not preceded by a backslash (via look-behind)
(?:\\\\)* # an even number of backslashes (don't capture them)
\* # a star
) # end look-ahead. If found,
( # start group 1
\\* # match any number of backslashes in front of the star
) # end group 1
\* # match the star itself
The look-ahead makes sure only even numbers of backslashes are taken into account. Anyway, there is no way around matching them into a group, since the look-ahead does not advance the position in the string.
Ok, since Tim decided to not update his regex with my suggested mods (and Tomalak's answer is not as streamlined), here is my recommended solution:
Replace: ((?<!\\)(?:\\\\)*)\* with $1%
Here it is in the form of a commented PHP snippett:
// Replace all non-escaped asterisks with "%".
$re = '% # Match non-escaped asterisks.
( # $1: Any/all preceding escaped backslashes.
(?<!\\\\) # At a position not preceded by a backslash,
(?:\\\\\\\\)* # Match zero or more escaped backslashes.
) # End $1: Any preceding escaped backslashes.
\* # Unescaped literal asterisk.
%x';
$text = preg_replace($re, '$1%', $text);
Addendum: Non-lookaround JavaScript Solution
The above solution does require lookbehind, so it will not work in JavaScript. The following JavaScript solution does not use lookbehind:
text = text.replace(/(\\[\S\s])|\*/g,
function(m0, m1) {
return m1 ? m1 : '%';
});
This solution replaces each instance of backslash-anything with itself, and each instance of * asterisk with a % percent sign.
Edit 2011-10-24: Fixed Javascript version to correctly handle cases such as: **text**. (Thanks to Alan Moore for pointing out the error in previous version.)
Others have shown how this can be done with a lookbehind, but I'd like to make a case for not using lookarounds at all. Consider this solution (demo here):
s/\G([^*\\]*(?:\\.[^*\\]*)*)\*/$1%/g;
The bulk of the regex, [^*\\]*(?:\\.[^*\\]*)*, is an example of Friedl's "unrolled loop" idiom. It consumes as many as it can of individual characters other than asterisk or backslash, or pairs of characters consisting of a backslash followed by anything. That allows it to avoid consuming unescaped asterisks, no matter how many escaped backslashes (or other characters) precede them.
The \G anchors each match to the position where the previous match ended, or to the beginning of the input if this is the first match attempt. This prevents the regex engine from simply skipping over escaped backslashes and matching the unescaped asterisks anyway. So, each iteration of the /g controlled match consumes everything up to the next unescaped asterisk, capturing all but the asterisk in group #1. Then that's plugged back in and the * is replaced with %.
I think this is at least as readable as the lookaround approaches, and easier to understand. It does require support for \G, so it won't work in JavaScript or Python, but it works just fine in Perl.
So you essentially want to match * only if it's preceded by an even number of backslashes (or, in other words, if it isn't escaped)? Then you don't need lookahead at all since you're only looking back, aren't you?
Search for
(?<=(?<!\\)(?:\\\\)*)\*
and replace with %.
Explanation:
(?<= # Assert that it's possible to match before the current position...
(?<!\\) # (unless there are more backslashes before that)
(?:\\\\)* # an even number of backslashes
) # End of lookbehind
\* # Then match an asterisk
The problem of detecting escaped backslashes in regex has fascinated me for a while, and it wasn't until recently that I realized I was completely overcomplicating it. There are a couple of things that make it simpler, and as far as I can tell nobody here has noticed them yet:
Backslashes escape any character after them, not just other backslashes. So (\\.)* will eat an entire chain of escaped characters, whether they're backslashes or not. You don't have to worry about even- or odd-numbered slashes; just check for a solitary \ at the beginning or end of the chain (ridgerunner's JavaScript solution does take advantage of this).
Lookarounds aren't the only way to make sure you start with the first backslash in a chain. You can just look for a non-backslash character (or the start of the string).
The result is a short, simple pattern that doesn't need lookarounds or callbacks, and it's shorter than anything else I see so far.
/(?!<\\)(\\.)*\*/g
And the replacement string:
"$1%"
This works in .NET, which allows lookbehinds, and it should work for you in Perl. It's possible to do it in JavaScript, but without lookbehinds or the \G anchor, I can't see a way to do it in a one-liner. Ridgerunner's callback should work, as will a loop:
var regx = /(^|[^\\])(\\.)*\*/g;
while (input.match(regx)) {
input = input.replace(regx, '$1$2%');
}
There are a lot of names here I recognize from other regex questions, and I know some of you are smarter than me. If I've made a mistake, please say so.

Regex that says what NOT to match?

I’m wondering how to match any characters except for a particular string (call it "for") in a regex.
I was thinking maybe it was something like this: [^for]* — except that that doesn’t work.
I’m sure this a dup.
One way is to start your pattern with a lookahead like this:
(?=\A(?s:(?!for).)*\z)
That can be written like this in any regex system worth bothering with:
(?x) # turn /x mode on for commentary & spacing
(?= # lookahead assertion; hence nonconsumptive
\A # beginning of string
(?s: # begin atomic group for later quantification
# enable /s mode so dot can cross lines
(?! for ) # lookahead negation: ain't no "for" here
. # but there is any one single code point
) # end of "for"-negated anything-dot
* # repeat that group zero or more times, greedily
\z # until we reach the very end of the string
) # end of lookahead
Now just put that in the front of your pattern, and add whatever else you’d like afterwords. That’s how you express the logic !/for/ && ⋯ when you have to built such knowledge into the pattern.
It is similar to how you construct /foo/ && /bar/ && /glarch/ when you have to put it in a single pattern, which is
(?=\A(?s:.)*foo)(?=\A(?s:.)*bar)(?=\A(?s:.)*glarch)
^(?!for$).*$
matches any string except for.
^(?!.*for).*$
matches any string that doesn't contain for.
^(?!.*\bfor\b).*$
matches any string that doesn't contain for as a complete word, but allows words like forceps.
You can try to check whether the string matches for, and negate the result, in whatever language you use (e.g. if (not $_ =~ m/for/) in Perl)