How to negate a word in pure ERE regex - regex

Is it possible to negate a word in pure ERE regex (POSIX.2 Extended Regular Expressions) ?
its seems possible to negate a character or class using ^ as [^ab] which will negate either character a b but how to negate a word as ab (and not either)?
If I have a variable with the following value:
week="Monday|Tuesday|Wednesday|Thursday"
Is it possible to extract all content of week without Tuesday to get a result as:
week="Monday|Wednesday|Thursday"
Regards

You cannot negate a word with any regex engine.
It is only possible to match a continuous piece of text in one go. So, you cannot get a Monday|Wednesday|Thursday as full match after running the string through a regex.
If you want to skip a word you usually remove it from the string. Check if your environment offers a replacing feature.
Also, sometimes, it is possible to split a string with the string to skip. Check if there is split support in your environment.

Related

How do I match any character but "&&"?

I have the following string:
a simple string that may contain this & that but I want it to skip &&. Do you follow?
I am working with the following regex to try to split it on a double occurrence of "&&":
[^&]*\s
This picks up the single "&" occurrence.
I've also tried:
[^&]{2}*\s
But that doesn't pick up anything.
The stuff I have found online apply to PCRE regex and I am unable to find an RE2 type solution that go uses.
https://regex101.com/r/kgVVvB/1
Normally, you would be able to use (?:(?!&&|(?<=&)&).)+, but since golang doesn't currently support lookaheads or lookbehinds, you have to hack your way around this using regex. Obviously, using string functions may work for your case, but as you mention in the comments below your question this is part of a larger regex, so here it is:
See regex in use here
(?:(?:^|[^&])&(?:[^&]|$)|[^&])+
(?:(?:^|[^&])&(?:[^&]|$)|[^&])+ Match either of the following one or more times
(?:^|[^&])&(?:[^&]|$)
(?:^|[^&]) Assert position at the start of the line or match any character that is not &
& Match this literally
(?:^|[^&]) Assert position at the end of the line or match any character that is not &
[^&] Match any character that is not &

Possible to use a back reference in a number range?

I want to match a string where a number is equal or higher than a number in a capturing group.
Example:
1x1 = match
1x2 = match
2x1 = no match
In my mind the regex would look something like this (\d)x[\1-9] but this doesn't work. Is it possible to achieve this using regex?
As you've discovered, you cannot interpolate a value within a regex because:
Because character classes are determined when the regex is compiled... The only character class regex node type is "hard-coded list of characters" that was built when the regex was compiled (not after it ran part way and figured out what $1 might end up being).
[Source]
Since character classes do not permit backreferences, a backslash followed by a number is repurposed in a character class:
A backslash followed by two or three octal digits is considered an octal number.
[Source]
This obviously isn't what you intended by [\1-9]. But since there's no way to compile a character class until all characters are known, we'll have to find another way.
If we're looking to do this entirely within a regex we can't enumerate all possible combinations, because we'd have to check all the captures to figure out which one matched. For example:
"1x2" =~ m/(?:(0)x(\d)|(1)x([1-9])|(2)x([2-9])|(3)x([3-9])|(4)x([4-9])|(5)x([5-9])|(6)x([6-9])|(7)x([7-9])|(8)x([89])|(9)x(9))/
Will contain "1" in $3 and "2" in $4, but you'd have to search captures 1 to 20 to find if anything was matched each time.
The only way around doing post processing on regex results is to use a regex conditional: (?(A)X) Where A is a conditional and X is the resulting action.
Sadly conditionals are not supported by RE2, but we'll keep going just to demonstrate it can be done.
What you'd want to use for the X is (*F) (or (?!) in Ruby 2+) to force failure: http://www.rexegg.com/regex-tricks.html#fail
What you'd want to use for the A is ?{$1 > $2}, but only Perl will allow you to use code directly in a regex. Perl would allow you to use:
m/(\d)x(\d)(?(?{$1 > $2})(?!))/
[Live Example]
So the answer to your question is: "No, you cannot do this with RE2 which Google Analytics uses, but yes you can do this with a Perl regex."

regular expressions: first match vs greedy match

Consider the regular expression \d*
If I try to match this against the string JJJ123, Vertica's regex functions say it matches against the string of width zero at the beginning.
If I try it instead in matlab, it reports a match starting at the character 1.
The Vertica docs say that its regex engine is PCRE. I can't find much on matlab's, though I found hints that it's similar to perl's.
Which of the behaviors is more standard for perl-like regex engine?
Matlab's regexp has an emptymatch option that controls whether it will allow an entire regex expression to match an empty string. It is off ("noemptymatch") by default. See help regexp.
Vertica's matching the 0-length empty string at the beginning is normal behavior for most regex dialects that I know, including anything Perl-like.
To get the same behavior as Vertica, where it can match 0-length strings, pass the 'emptymatch' option in your regexp call. Also pass 'once' to prevent it from matching the empty spaces between each and every character in your string.
[a,b,c,d] = regexp('JJJ123', '\d*', 'emptymatch', 'once')

Regular expression using negative lookbehind not working in Notepad++

I have a source file with literally hundreds of occurrences of strings flecha.jpg and flecha1.jpg, but I need to find occurrences of any other .jpg image (i.e. casa.jpg, moto.jpg, whatever)
I have tried using a regular expression with negative lookbehind, like this:
(?<!flecha|flecha1).jpg
but it doesn't work! Notepad++ simply says that it is an invalid regular expression.
I have tried the regex elsewhere and it works, here is an example so I guess it is a problem with NPP's handling of regexes or with the syntax of lookbehinds/lookaheads.
So how could I achieve the same regex result in NPP?
If useful, I am using Notepad++ version 6.3 Unicode
As an extra, if you are so kind, what would be the syntax to achieve the same thing but with optional numbers (in this case only '1') as a suffix of my string? (even if it doesn't work in NPP, just to know)...
I tried (?<!flecha[1]?).jpg but it doesn't work. It should work the same as the other regex, see here (RegExr)
Notepad++ seems to not have implemented variable-length look-behinds (this happens with some tools). A workaround is to use more than one fixed-length look-behind:
(?<!flecha)(?<!flecha1)\.jpg
As you can check, the matches are the same. But this works with npp.
Notice I escaped the ., since you are trying to match extensions, what you want is the literal .. The way you had, it was a wildcard - could be any character.
About the extra question, unfortunately, as we can't have variable-length look-behinds, it is not possible to have optional suffixes (numbers) without having multiple look-behinds.
Solving the problem of the variable-length-negative-lookbehind limitation in Notepad++
Given here are several strategies for working around this limitation in Notepad++ (or any regex engine with the same limitation)
Defining the problem
Notepad++ does not support the use of variable-length negative lookbehind assertions, and it would be nice to have some workarounds. Let's consider the example in the original question, but assume we want to avoid occurrences of files named flecha with any number of digits after flecha, and with any characters before flecha. In that case, a regex utilizing a variable-length negative lookbehind would look like (?<!flecha[0-9]*)\.jpg.
Strings we don't want to match in this example
flecha.jpg
flecha1.jpg
flecha00501275696.jpg
aflecha.jpg
img_flecha9.jpg
abcflecha556677.jpg
The Strategies
Inserting Temporary Markers
Begin by performing a find-and-replace on the instances that you want to avoid working with - in our case, instances of flecha[0-9]*\.jpg. Insert a special marker to form a pattern that doesn't appear anywhere else. For this example, we will insert an extra . before .jpg, assuming that ..jpg doesn't appear elsewhere. So we do:
Find: (flecha[0-9]*)(\.jpg)
Replace with: $1.$2
Now you can search your document for all the other .jpg filenames with a simple regex like \w+\.jpg or (?<!\.)\.jpg and do what you want with them. When you're done, do a final find-and-replace operation where you replace all instances of ..jpg with .jpg, to remove the temporary marker.
Using a negative lookahead assertion
A negative lookahead assertion can be used to make sure that you're not matching the undesired file names:
(?<!\S)(?!\S*flecha\d*\.jpg)\S+\.jpg
Breaking it down:
(?<!\S) ensures that your match begins at the start of a file name, and not in the middle, by asserting that your match is not preceded by a non-whitespace character.
(?!\S*flecha\d*\.jpg) ensures that whatever is matched does not contain the pattern we want to avoid
\S+\.jpg is what actually gets matched -- a string of non-whitespace characters followed by .jpg.
Using multiple fixed-length negative lookbehinds
This is a quick (but not-so-elegant) solution for situations where the pattern you don't want to match has a small number of possible lengths.
For example, if we know that flecha is only followed by up to three digits, our regex could be:
(?<!flecha)(?<!flecha[0-9])(?<!flecha[0-9][0-9])(?<!flecha[0-9][0-9][0-9])\.jpg
Are you aware that you're only matching (in the sense of consuming) the extension (.jpg)? I would think you wanted to match the whole filename, no? And that's much easier to do with a lookahead:
\b(?!flecha1?\b)\w+\.jpg
The first \b anchors the match to the beginning of the name (assuming it's really a filename we're looking at). Then (?!flecha1?\b) asserts that the name is not flecha or flecha1. Once that's done, the \w+ goes ahead and consumes the name. Then \.jpg grabs the extension to finish off the match.

regex to match strings not ending with a pattern?

I am trying to form a regular expression that will match strings that do NOT end a with a DOT FOLLOWED BY NUMBER.
eg.
abcd1
abcdf12
abcdf124
abcd1.0
abcd1.134
abcdf12.13
abcdf124.2
abcdf124.21
I want to match first three.
I tried modifying this post but it didn't work for me as the number may have variable length.
Can someone help?
You can use something like this:
^((?!\.[\d]+)[\w.])+$
It anchors at the start and end of a line. It basically says:
Anchor at the start of the line
DO NOT match the pattern .NUMBERS
Take every letter, digit, etc, unless we hit the pattern above
Anchor at the end of the line
So, this pattern matches this (no dot then number):
This.Is.Your.Pattern or This.Is.Your.Pattern2012
However it won't match this (dot before the number):
This.Is.Your.Pattern.2012
EDIT: In response to Wiseguy's comment, you can use this:
^((?!\.[\d]+$)[\w.])+$ - which provides an anchor after the number. Therefore, it must be a dot, then only a number at the end... not that you specified that in your question..
If you can relax your restrictions a bit, you may try using this (extended) regular expression:
^[^.]*.?[^0-9]*$
You may omit anchoring metasymbols ^ and $ if you're using function/tool that matches against whole string.
Explanation: This regex allows any symbols except dot until (optional) dot is found, after which all non-numerical symbols are allowed. It won't work for numbers in improper format, like in string: abcd1...3 or abcd1.fdfd2. It also won't work correctly for some string with multiple dots, like abcd.ab123cd.a (the problem description is a bit ambigous).
Philosophical explanation: When using regular expressions, often you don't need to do exactly what your task seems to be, etc. So even simple regex will do the job. An abstract example: you have a file with lines are either numbers, or some complicated names(without digits), and say, you want to filter out all numbers, then simple filtering by [^0-9] - grep '^[0-9]' will do the job.
But if your task is more complex and requires validation of format and doing other fancy stuff on data, why not use a simple script(say, in awk, python, perl or other language)? Or a short "hand-written" function, if you're implementing stand-alone application. Regexes are cool, but they are often not the right tool to use.
I would just use a simple negative look-behind anchored at the end:
.*(?<!\\.\\d+)$