Optional regular expression operator in PowerShell - regex

In $string, I'm trying to phase out the first "-1" so the output of the string will be "test test test-Long.xml".
$string = 'test test test-1-Long.xml'
$string -replace '^(.*)-?\d?(-?.*)\.xml$', '$1$2'
My issue is that I need to make that same first "-1" pattern optional, as both the hyphen and number could not be there as well.
Why is the "?" operator not working? I've also tried {0,1} after each as well with no luck.

Regexes are greedy, so the engine can't decide what to match, and it is ambiguous.
I am not sure it's the best solution, but I could make it work this way:
$string -replace '^([^\-]*)-?\d?(-?.*)\.xml$', '$1$2'
Sole change: the first group must not contain the dash: that kind of "balances" the regex, avoiding the greedyness and that yields:
test test test-Long
Note: the output is not test test test-Long.xml as required in your answer. To do that, simply remove the xml suffix:
$string -replace '^([^\-]*)-?\d?(-?.*)', '$1$2'

The $string -replace '^(.*?)(?:-\d+)?(-.*?)\.xml$', '$1$2' should work if the hyphen is obligatory in the input. Or $string -replace '^((?:(?!-\d+).)*)(?:-\d+)?(.*)\.xml$', '$1$2' in case the input may have no hyphen.
See the regex demo 1 and regex demo 2.
Pattern details:
^ - start of string
(.*?) - Group 1 capturing any 0+ characters other than a newline as few as possible (as the *? quantifier is lazy) up to the first (NOTE: to increase regex performance, you may use a tempered greedy token based pattern instead of (.*?) - ((?:(?!-\d+).)*) that matches any text, but -+1 or more digits, thus, acting similarly to negated character class, but for a sequence of symbols)
(?:-\d+)? - non-capturing group with a greedy ? quantifier (so, this group has more priority for the regex engine, the previous capture will end before this pattern) capturing a hyphen followed with one or more digits
(-.*?) - Group 3 capturing an obligatory - and any 0+ chars other than LF, as few as possible up to
\.xml - literal text .xml
$ - end of string.
Why is the "?" operator not working?
It is not true. The quantifier ? works well as it matches one or zero occurrences of the quantified subpattern. However, the issue arises in combination with the first .* greedy dot matching subpattern. See your regex in action: the first capture group grabs the whole substring up to the last .xml, and the second group is empty. Why?
Because of backtracking and how greedy quantifier works. The .* matches any characters, but a newline, as many as possible. Thus, it grabs the whole string up to the end. Then, backtracking starts: one character at a time is given back and tested against the subsequent subpatterns.
What are they? -?\d?(-?.*) - all of them can match an empty string. The -? matches an empty string before .xml, ok, \d? matches there as well, -? and .* also matches there.
However, the .* grabs the whole string again, but there is the \.xml pattern to accommodate. So, the second capture group is just empty. In fact, there are more steps the regex engine performs (see the regex debugger page), but the main idea is like that.

Related

RegEx matching only within a match / restrict matching to part of string

Is there a way to use a single regular-expression to match only within another math. For example, if I want to remove spaces from a string, but only within parentheses:
source : "foobar baz blah (some sample text in here) and some more"
desired: "foobar baz blah (somesampletextinhere) and some more"
In other words, is it possible to restrict matching to a specific part of the string?
In PCRE a combination of \G and \K can be used:
(?:\G(?!^)|\()[^)\s]*\K\s+
\G continues where the previous match ended
\K resets beginning of the reported match
[^)\s] matches any character not in the set
See demo at regex101
The idea is to chain matches to an opening parentheses. The chain-links are either [^)\s]* or \s+. To only get spaces \K is used to reset before. This solution does not require a closing ).
In other regex flavors that support \G but not \K, capturing groups can help out. Eg Search for
(\G(?!^)|\()([^)\s]*)\s+
and replace with captures of the 2 groups (depending on lang: $1$2 or \1\2) - Regex101 demo
Further there is (*SKIP)(*F), a PCRE feature for skipping over certain parts. It is often used together with The Trick. The idea is simple: skip this(*SKIP)(*F)|match that - Regex101 demo. Also this can be worked around with capture groups. Eg replace ([^)(]*\(|\)[^)(]*)|\s with$1
One idea is to replace any space between parentheses using a lookahead pattern:
(?=([^\s\(]+ )*\S*\))(?!\S*\s*\()`
The lookahead will attempt to match the last space before the closed parenthesis (\S*\)) and any optional space before ([^\s\(]+ )* (if found).
Detailed Regex Explanation:
: space
(?=([^\s\(]+ )*\S*\)): lookahead non-capturing group
([^\s\(]+ )*: any combination characters not including the open parenthesis and the space characters + space (this group is optional)
\S*\): any non-space character + closed parenthesis
(?!\S*\s*\(): what lookahead should not be
\S*: any non space character (optional), followed by
\s*: any space character (optional), followed by
\(: the open parenthesis
Check the demo here.

Ungreedy with look behind

I have this kind of text:
other text opt1 opt2 opt3 I_want_only_this_text because_of_this
And am using this regex:
(?<=opt1|opt2|opt3).*?(?=because_of_this)
Which returns me:
opt2 opt3 I_want_only_this_text
However, I want to match only "I_want_only_this_text".
What is the best way to achieve this?
I don't know in what order the opt's will appear and they are only examples. Actual words will be different and there will be more of them.
Test screenshot
Actual data:
regex
(?<=※|を|備考|町|品は|。).*(?=のお届けとなります|でお届けします|にてお届け致します|にてお届けいたします)
text
こだわり豚には通常の豚よりビタミンB1が2倍以上あります。私たちの育てた愛情たっぷりのこだわり豚をぜひ召し上がってください。商品説明名称えびの産こだわり豚切落し産地宮崎県えびの市内容量500g×8パック合計4kg賞味期限90日保存方法-15℃以下で保存すること提供者株式会社さつま屋産業備考・本お礼品は冷凍でのお届けとなります
what I want to get:
冷凍で
You can use
(?<=※|を|備考|町|品は|。)(?:(?!※|を|備考|町|品は|。).)*?(?=のお届けとなります|でお届けします|にてお届け致します|にてお届けいたします)
See the regex demo. The scheme is the same as in (?<=opt1|opt2|opt3)(?:(?!opt1|opt2|opt3).)*?(?=because_of_this) (see demo).
The tempered greedy token solution allows you to match multiple occurrences of the same pattern in a longer string.
Details
(?<=※|を|備考|町|品は|。) - a positive lookbehind that matches a location that is immediately preceded with one of the alternatives listed in the lookbehind
(?:(?!※|を|備考|町|品は|。).)*? - any char other than a line break char, zero or more but as few as possible occurrences, that is not a starting point of any of the alternative patterns in the negative lookahead
(?=のお届けとなります|でお届けします|にてお届け致します|にてお届けいたします) - a positive lookahead that requires one of the alternative patterns to appear immediately to the right of the current location.
You could add a negative lookahead (?!\s*opt\d) to assert that there is no opt and a digit to the right. You can use a character class to list the digits 1, 2 and 3 instead of using the alternation with |.
(?<=\bopt[123]\s(?!\s*opt\d)).*?(?=\s*\bbecause_of_this\b)
Regex demo
It might be a bit more efficient to use a match with a capture group:
\bopt[123]\s(?!\s*opt\d)(.*?)\s*\bbecause_of_this\b
Regex demo
What about:
.*\bopt[123]\b\s*(.*?)\s*because_of_this\b
See the online demo.
.* - A greedy match of any character other than newline upto the last occurence of:
\bopt[123]\b - A word boundary followed by literally "opt" with a trailing number 1, 2 or 3 and another word boundary.
\s* - 0+ whitespace characters.
(.*?) - A 1st capture group with a lazy match of 0+ characters upto:
\s* - 0+ whitespace characters.
because_of_this\b - Literally "because_of_this" followed by a word-boundary.
If you need to have this written out in alternations:
.*\b(?:opt1|opt2|opt3)\b\s*(.*?)\s*because_of_this\b
See that demo.

Match any character but no empty and not only white spaces

I have this regex:
\[tag\](.*?)\[\/tag\]
It match any character between two tags. The problem that is matching also empty contents or just white spaces inside the tags, for example:
[tag][/tag]
[tag] [/tag]
How can I avoid it? Make it to match at least 1 character and not only white spaces. Thanks!
Use
\[tag\](?!\s*\[\/tag\])(.*?)\[\/tag\]
^^^^^^^^^^^^^^^^
See the regex demo and the Regulex graph:
The (?!\s*\[\/tag\]) is a negative lookahead that fails the match if, immediately to the right of the current location, there is 0+ whitespaces, [/tag].
You might change your expression to something similar to this:
\[tag\]([\s\S]+)\[\/tag\]
and you might add a quantifier to it, and bound it with number of chars, similar to this expression:
\[tag\]([\s\S]{3,})\[\/tag\]
Or you could do the same with your original expression as this expression:
Try this regex:
\[(tag)\](?!\s*\[\/\1\])(.*?)\[\/\1\]
This regex matches tag only if it has at least one non-whitespace char.
If this is a PCRE (or php) or NP++ or Perl, use this
(?s)(?:\[tag\]\s*\[/tag\](*SKIP)(?!)|\[tag\]\s*(.+?)\s*\[/tag\])
https://regex101.com/r/aCsOoQ/1
If not, you're stuck with using Stribnetz regex, which works because of
an odd condition of your requirements.
Readable
(?s)
(?:
\[tag\]
\s*
\[/tag\]
(*SKIP)
(?!)
|
\[tag\]
\s*
( .+? ) # (1)
\s*
\[/tag\]
)

RegEx: don't capture match, but capture after match

There are a thousand regular expression questions on SO, so I apologize if this is already covered. I did look first.
I have string:
Name Subname 11X22 88X620 AB33(20) YA5619 77,66
I need to capture this string: YA5619
What I am doing is just finding AB33(20) and after this I am capturing until first white space. But AB33(20) can be AB-33(20) or AB33(-20) or AB33(-1).
My preg_match regex is: (?<=\bAB\d{2}\(\d{2}\)\s).+?(?=\s)
Why I am getting error when I change from \d{2} to \d+?
For final result I was thinking this regix will work but no:
(?<=\bAB-?\d+\(-?\d+\)\s).+?(?=\s)
Any ideas what I am doing wrong?
With most regex flavors, lookbehind needs to evaluate to a fixed-length sequence, so you can't use variable quantifiers like * or + or even {1,2}.
Instead of using lookaround, you can simply match your marker pattern and then forget it with \K.
AB-?\d+(?:\(-?\d+\))? \K[^ ]+
demo: https://regex101.com/r/8XXngH/1
It depends on the language. If it is in .NET for example, it matches due to the various length in the lookbehind.
Another solution might be to use a character class and add the character you would allow to match. Then match a whitespace character and capture in a group matching \S+ which matches 1+ times not a whitespace character.
\bAB[()\d-]+\s\K\S+
Explanation
\bAB Match literally prepended with word boundary to prevent AB being part of a larger match.
[()\d-]+ Match 1+ times any of the listed character in the character class
\s Match a whitespace char (or \s+ to match 1 or more)
\K Reset the starting point of the reported match( Forget what was matched)
\S+ Match in a group 1+ times not a whitespace character
Regex demo | Php demo

reg expression to truncate a string from last dot

I have following string and I want to strip the last part starting from dot. Could you please advise? I am new to reg expressions.
[abc].[def].[ghi]
Thanks,
mc
The regexp you need is:
(.*?)(?:\.[^.]*)?$
The regexp piece by piece:
( # start of the first capturing sub-pattern
.* # matches any character, any number of times (zero or more)
? # make the previous quantifier (`*`) not greedy
) # end of the first sub-pattern
(?: # start of the second sub-pattern; it doesn't capture the matching string
\. # matches a dot (.)
[^.]* # matches anything but a dot (.), any number of times (zero or more)
) # end of the second sub-pattern
? # the previous sub-expression (the non-capturing sub-pattern) is optional
$ # matches the end of the string
How it works:
The first part (.*?) matches and captures everything until the last dot. The question mark (?) makes the zero or more quantifier (*) not greedy. It is greedy by default and, because of the second sub-expression have to be optional (read below), its greediness makes it match the entire string.
The ?: specifier at the start of the second sub-pattern makes it non-capturing. The sub-string it matches is not stored and it's not available for further use.
The second sub-pattern contains \.[^.]* and matches a dot (.) followed by zero or more characters but none of them can be dots. It doesn't match anything if the input string doesn't contain a dot and this makes the entire regexp not matching. This is why it is marked as optional by following it with a question mark (?).
Most tools that work with regexp provide a way to get and use the captured strings using $n or \n as placeholders in the replacement string. n above is the number of the capturing pattern, counting by its open parenthesis (. Since we have only one capturing sub-pattern, the substring it matches should be available either as $1 or \1 (or both, or using a different syntax).
You can play with this regexp on regex101.com.