Negative Lookbehind Workaround For Posix Regex

Negative Lookbehind Workaround For Posix Regex - regex

I need to exclude a string from being matched if it's preceeded by a certain character, and my regex engine is POSIX. I was able to get the desired result using a negative lookbehind on https://regexr.com/ but just discovered that won't work on my POSIX SnowFlake platform :-( .
I'm trying to standardize variations of company names and want to match the strings that end in 'COMPANY', 'CO', or 'CO.', but not match them if preceeded by an ' & '. So 'COMPANY' would get matched in 'POWERWASH COMPANY', but not in 'JONES & COMPANY'.
Is there a way I can accomplish this in POSIX regex? I was able to get this to work using a negative lookbehind as follows:
(?<!&)( COMPANY$| CO[.]?$)

You may use a capturing group (as you're already doing) and put the irrelevant parts outside of the group:
[^&]( COMPANY| CO\.?)$
Demo.
I'm not that familiar with SnowFlake but according to the documentation, you can extract the value captured by group 1 using the regexp_substr method as follows:
regexp_substr(input, '[^&]( COMPANY| CO\.?)$', 1, 1, 'e', 1)
-- ^
-- Group number
Note that [^&] will match any character other than '&'. If you'd like the match to succeed even if the target word is at the beginning of the string, you may use (^|[^&]) in place of [^&]. In that case, you may extract the value from group 2 rather than group 1.

You can use
(^|[^&])( COMPANY| CO[.]?)$
See the regex demo.
Whatever you capture is usually of no importance in POSIX regex, but in other cases it is usually easy to work around using additional capturing groups and code logic.
Regex details:
(^|[^&]) - start of string or any char other than &
( COMPANY| CO[.]?) - either a space and COMPANY, or a space, CO, an optional . and
$ - end of string

Related

Regex optionally extracting characters between two characters

I have the following string thisIs/My-7777-Any-other-text it also is possible for the following thisIs/My-7777
I am looking to extract My-777 in both scenarios using regex. So essentially I am looking to extract everything between the first forward flash and the second hyphen (Second hyphen may not exist). I tried the following regex which wasn't quite right
(?<=\/)(.*)(?=-)

You could use a capture group
^[^\/]*\/([^-]*-[^-]*)
^ Start of string
[^\/]*\/
( Capture group
[^-]*-[^-]* Match a - between optional chars that are not -
) Close capture group
regex demo
Without an anchor, and not allowing / before and after -
[^\/]*\/([^-\/]*-[^-\/]*)
Regex demo

If we take into account the structure of your current input strings, you can use
(?<=\/)[^-]+-[^-]+
See the regex demo.
If your strings are more complex and look like thisIs/My-7777/more-text-here, and you actually want to match from the first /, then you may use
^[^\/]+\/\K[^\/-]+-[^\/-]+ ## PHP, PCRE, Boost (Notepad++), Onigmo (Ruby)
(?<=^[^\/]+\/)[^\/-]+-[^\/-]+ ## JS (except IE & Safari), .NET, Python PyPi regex)
See this regex demo or this regex demo. Note \n is added in the demo since the input is a single multiline string, in real life input, if a newline char is expected, use it in each negated character class to keep matching on the one line.

This one is working for me, Try it with case insensitive ticked
Find what: .*?/|-any.*
Replace with: blank
Output should be ↠↠ My-7777

how to make a regex to validate a username

I've written this regex
/(?=.*[a-z])(?!.*[A-Z])([\w\_\-\.].{3,10})/g
to check the following conditions
>has minimum of 3 and maximum of 10 characters.
>must contain atleast a lowercase alphabet.
>must contain only lowercase alphabets, '_', '-', '.' and digits.
this works but returnes true even if there is more than 10 characters.
I would like a new or modified regular expression to check the above given conditions.

add hanchors
remove the last dot
the negative lookahead is useless is you use a correct character class
This regex will work:
^(?=.*[a-z])[a-z0-9_.-]{3,10}$
Demo & explanation

You can use this REGEX
REGEX Demo
([a-z]{1}[0-9a-z_.-]{2,9})
, Tried text
username77
usr
username10
user_test
usr.1000

There are many ways of doing this. I believe the common characteristic is they will all have a positive lookahead. Here is another.
^(?=.{3,10}$)[a-z\d_.-]*[a-z][a-z\d_.-]*$
Demo
Notice that [a-z\d_.-]* appears twice. Some regex engines support subroutines (or subexpressions) that allow one to save a repeated part of the regex to a numbered or named capture group for reuse later in the string. When using the PCRE engine, for example, you could write
^(?=.{3,10}$)([a-z\d_.-]*)[a-z](?1)$
Demo
(?1) is replaced by the regex tokens that matched the string saved to capture group 1 ([a-z\d_.-]*), as contrasted with \1, which references the content of capture group 1. The use of subroutines can shorten the regex expression, but more importantly it reduces the chance of errors when changes are made to the regex's tokens that are repeated.

regex - match only quotes surrounding numeric values

So I use a lot of regex to format SQL.
I'm trying to match all quotes surrounding numeric values (INT) so I can remove them.
I use this to match numerics in qoutes:
(?<=(['"])\b)(?:(?!\1|\\)[0-9]|\\.)*(?=\1)
Playing with this so far but no luck yet:
'(?=[0-9](?='))*
What i'm trying to say is look ahead infinity, matching anything that is a number unless it is quote then accept then match.
Any regex ninja's out there can help put me on the path?
Here's an example string:
'2018-12-09 07:29:00.0000000', 'US', 'MI', 'Detroit', '48206', '505', '68.61.112.245', '0', 'Verizon'
I just want to match the ' around 48206, 505, and 0 so I can strip them.
To be safe lets assume there are other characters as well that could appear in the test string. ie - its not really feasible to say just match anything that's no a dash a letter or a dot, etc. Also the question is language-agnostic so any applicable language is fine -- JavaScript, Python, Java, etc.

You can select all such numbers using this regex,
'(\d+)'
And then replace it with \1 or $2 as per your language.
Demo
This will get rid of all quotes that are surrounding numbers.
Let me know if this works for you.
Also, as an alternative solution, if your regex engine supports ECMAScript 2018, then you can exploit variable length look behind and use this regex to select only quotes that surround a number,
'(?=\d+')|(?<='\d+)'
And replace it with empty string.
Demo
Make sure you check this demo in Chrome which supports it and not Mozilla which doesn't support it.

.split().join() Chain
.split() can use RegEx such as this:
/'\b([0-9]+?)\b'/
Literal match single straight quote: '
Meta sequence word boundary sets the beginning of a word/number: \b
Capture group: ( class range: [ of any digit: 0-9]
Match at least once and continue to do so until the next word border is reached and a literal straight single quote: )+?\b'
Since .split() iterates through the string a global flag isn't needed. .join(''); is chained to .split() and the result is back to a string from am array.
Demo
var strA = `'2018-12-09 07:29:00.0000000', 'US', 'MI', 'Detroit', '48206', '505', '68.61.112.245', '0', 'Verizon'`;
var strB = strA.split(/'\b([0-9]+?)\b'/).join('');
console.log(strB);

You could capture a single or a double quote as in your first regex in a capturing group and then capture the digits in betweenin group 2 and finally use a backreference to group 1
In the replacement, use the second capturing group $2 for example
(['"])(\d+)\1
Explanation
(['"]) Capture ' or " in a capturing group
(\d+) Capture 1+ digits in a group
\1 Backreference to group 1
Regex demo
Result
''2018-12-09 07:29:00.0000000', 'US', 'MI', 'Detroit', 48206, 505, '68.61.112.245', 0, 'Verizon''

Regex: how do I match a character before other capture characters?

I'm trying to match on a list of strings where I want to make sure the first character is not the equals sign, don't capture that match. So, for a list (excerpted from pip freeze) like:
ply==3.10
powerline-status===2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
psutil==4.0.0
ptyprocess==0.5.1
I want the captured output to look like this:
==3.10
==4.0.0
==0.5.1
I first thought using a negative lookahead (?![^=]) would work, but with a regular expression of (?![^=])==[0-9]+.* it ends up capturing the line I don't want:
==3.10
==2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
==4.0.0
==0.5.1
I also tried using a non-capturing group (?:[^=]) with a regex of (?:[^=])==[0-9]+.* but that ends up capturing the first character which I also don't want:
y==3.10
l==4.0.0
s==0.5.1
So the question is this: How can one match but not capture a string before the rest of the regex?

Negative look behind would be the go:
(?<!=)==[0-9.]+
Also, here is the site I like to use:
http://www.rubular.com/
Of course it does some times help if you advise which engine/software you are using so we know what limitations there might be.

If you want to remove the version numbers from the text you could capture not an equals sign ([^=]) in the first capturing group followed by matching == and the version numbers\d+(?:\.\d+)+. Then in the replacement you would use your capturing group.
Regex
([^=])==\d+(?:\.\d+)+
Replacement
Group 1 $1
Note
You could also use ==[0-9]+.* or ==[0-9.]+ to match the double equals signs and version numbers but that would be a very broad match. The first would also match ====1test and the latter would also match ==..

There's another regex operator called a 'lookbehind assertion' (also called positive lookbehind) ?<= - and in my above example using it in the expression (?<=[^=])==[0-9]+.* results in the expected output:
==3.10
==4.0.0
==0.5.1
At the time of this writing, it took me a while to discover this - notably the lookbehind assertion currently isn't supported in the popular regex tool regexr.
If there's alternatives to using lookbehind to solve I'd love to hear it.

Regex matching sortcode

I'm trying to match a UK bank sort code (nothing complex, just three pairs of digits, optionally separated by hyphens). I thought that I could just try and match the first hyphen and then use backreferencing to check if the value I'm matching was using hyphens at all.
/^\d{2}(-)?\d{2}\1\d{2}$/
Which should match against
12-34-56
123456
But not against
12-3456
1234-56
Which is fine and works fine — in JavaScript.
When I use a PCRE engine (e.g. PHP) the regex doesn't match as I'd expect. I've used a different regex to avoid this but I'd still like to know what's going on?
Can I use some internal dark magic along the lines of (*SKIP)(*FAIL) to be able to use the optional backreference?

JavaScript follows the ECMA standard specifications.
According to the official ECMA standard, a backreference to a non-participating capturing group must successfully match nothing just like a backreference to a participating group that captured nothing does. (source)
Here, in /^\d{2}(-)?\d{2}\1\d{2}$/, we have an optional capturing group that can match a hyphen. JavaScript regex engine "consumes" empty texts in optional groups for them to be later accessible via backreferences. This problem is closely connected with the Backreferences to Failed Groups. E.g. ((q)?b\2) will match b in JavaScript, but it won't match in PCRE.
So, a way out is using an obligatory capture group with an empty alternative (demo):
^\d{2}(-|)\d{2}\1\d{2}$
^^
Also, you can move the ? quantifier to the hyphen itself (demo):
^\d{2}(-?)\d{2}\1\d{2}$

You can change your regex to match it:
^\d{2}(-|)\d{2}\1\d{2}$
RegEx Demo
Changing (-)? to (-|) makes sure that we capture either hyphen or an empty string in group #1.
Code:
New RegEx:
preg_match('/^\d{2}(-|)\d{2}\1\d{2}$/', '123456', $m);
print_r($m);
Array
(
[0] => 123456
[1] =>
)
Older Regex:
preg_match('/^\d{2}(-)?\d{2}\1\d{2}$/', '123456', $m);
print_r($m);
Array
(
)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Negative Lookbehind Workaround For Posix Regex - regex

Related

Regex optionally extracting characters between two characters

how to make a regex to validate a username

regex - match only quotes surrounding numeric values

Regex: how do I match a character before other capture characters?

Regex matching sortcode

Categories

Resources