RegEx that excludes characters doesn't begin matching until 2nd character - regex

I'm trying to create a regular expression that will include all ascii but exclude certain characters such as "+" or "%" - I'm currently using this:
^[\x00-\x7F][^%=+]+$
But I noticed (using various RegEx validators) that this pattern only begins matching with 2 characters. It won't match "a" but it will match "ab." If I remove the "[^]" section, (^[\x00-\x7F]+$) then the pattern matches one character. I've searched for other options, but so far come up with nothing. I'd like the pattern to begin matching on 1 character but also exclude characters. Any suggestions would be great!

Try this:
^(?:(?![%=+])[\x00-\x7F])+$
Demo.
This will loop through, make sure that the "bad" characters aren't there with a negative lookahead, then match the "good" characters, then repeat.

You can use a negative lookahead here to exclude certain characters:
^((?![%=+])[\x00-\x7F])+$
RegEx Demo
(?![%=+]) is a negative lookahead that will assert that matched character is not one of the [%=+].

You could simply exclude those chars from the \x00-\x7f range (using the hex value of each char).
+----------------+
|Char|Dec|Oct|Hex|
+----------------+
| % |37 |45 |25 |
+----------------+
| + |43 |53 |2B |
+----------------+
| = |61 |75 |3D |
+----------------+
Regex:
^[\x00-\x24\x26-\x2A\x2C-\x3C\x3E-\x7F]+$
DEMO
Engine-wise this is more efficient than attempting an assertion for each character.

Related

Regex that match table input

I have this kind of input
||ID||Part Number||Product Name||Serial Number||Status||Dunning Status||Commitment End||Address||Country||
|1|SX0486|Mobilný Hlas Postpaid|0911193419|Active|Closed|04. 08. 2020| | |
I am looking for two regexes, one that match only inside headers ||ID||Part Number||Product Name||Serial Number||Status||Dunning Status||Commitment End||Address||Country|| from whole table input so no match |1|SX0486|Mobilný Hlas Postpaid|0911193419|Active|Closed|04. 08. 2020| | | the other I could theoretically split by newlines and by |...
I have tried something like [^\|\|]+(?=\|\|) ist good solution?
regex
You can't negate a sequence of characters with a negated character class, only individual chars.
I suggest using a regex that will extract any chunks of chars other than | between double ||:
(?<=\|\|)[^|]+(?=\|\|)
See the regex demo.
Details
(?<=\|\|) - two | chars must be present immediately on the left
[^|]+ - 1+ chars other than |
(?=\|\|) - two | chars must be present immediately on the right.
If you ever need to make sure there is exactly two pipes on each side, and not match if there are three or more, you will need to precise the pattern as (?<=(?<!\|)\|\|)[^|]+(?=\|\|(?!\|)).

Protect escaped chars from pattern ending

Assuming you have a pattern "A<(.*?)>"
Using Java, Pattern, Matcher, matcher.find() method as an example.
As input you have "A<v1>" --> Pattern is matching and the group(1) is "v1"
As input you have "A<v1>v2>" --> Pattern is matching and the group(1) is "v1" due to "?" turning ".*" to non-greedy.
Assuming a user want to protect the input like:
"A<v1\>v2>", so the pattern should match and the group(1) has the value "v1>v2".
So the pattern should stay "non-greedy", but a escaped chars is protect and be part of the value (grouping).
The pattern processing is done in a "while" loop, so I want to find all occurences of the pattern in the input. So the pattern should accept a less as possible (non-greedy), but can handle the "escaped" char (here: the ">" is my ending of the pattern)).
Any hints.
Thanks in advance.
You can accept \> as a valid expression to match:
A<((\\>|.)*?)>
The group (\\>|.) will match either the characters \> or, if that doesn't match, .. The order is important, because \> will match two characters while . only matches one, meaning that . will gobble up the \ character if it appears first.
To illustrate:
A < v 1 \> v 2 >
| | | | | | | |
A < ( . . \> . . )*? >
However, the resulting match would be v1\>v2, so you'll need to do some processing after the fact to convert \> to >
If you wanted to go even further and allow escaping the \ character, you could use a character class like so:
A<((\\[>\\]|.)*?)>
Which would match the following:
A<v1\\>

How to make negative lookbehind in regex work with following meta-sequence? [duplicate]

This question already has answers here:
Regex: match everything but a specific pattern
(6 answers)
Closed 3 years ago.
I'm having trouble understanding negative lookbehind in regular expressions.
For a simple example, say I want to match all Gmail addresses that don't start with 'test'.
I have created an example on regex101 here.
My regular expression is:
(?<!test)\w+?\.?\w+#gmail\.com
So it matches things like:
hagrid#gmail.com
harry.potter#gmail.com
But it also matches things like
test#gmail.com
where the original string was
test#gmail.com
I thought the (?<!test) should exclude that match?
(?<!test)\w+?\.?\w+#gmail\.com works by looking behind each character before moving forward with the match.
test#gmail.com
^
At the point marked by the ^ (before the 0th character), the engine looks behind and doesn't see "test", so it can happily march forward and match "test#gmail.com", which is legal per what remains of the pattern \w+?\.?\w+#gmail\.com.
Using a negative lookahead with a word boundary fixes the problem:
\b(?!test)\w+?\.?\w+#gmail\.com
Consider our target again on the updated regex:
test#gmail.com
^
At this point, the engine is at a word boundary \b, looks ahead and sees "test" and cannot accept the string.
You may wonder if the \b boundary is necessary. It is, because removing it matches "est#gmail.com" from "test#gmail.com".
test#gmail.com
^
The engine's cursor failed to match "test#gmail.com" from the 0th character, but after it steps forward, it matches "est#gmail.com" without problem, but that's not the intent of the programmer.
Demo of rejecting any email otherwise matching your format that begins with "test":
const s = `this is a short example hagrid#gmail.com of what I'm
trying to do with negative lookbehind test#gmail.com
harry.potter#gmail.com testasdf#gmail.com #gmail.com
a#gmail.com asdftest#gmail.com`;
console.log([...s.matchAll(/\b(?!test)\w+?\.?\w+#gmail\.com/g)]);
Note that \w+?\.?\w+ enforces that if there is a period, it must be between \w+ substrings, but this approach rejects a (probably) valid email like "a#gmail.com" because it's only one letter. You might want \b(?!test)(?:\w+?\.?\w+|\w)#gmail\.com to rectify this.
As the name suggests, the (?<! sequence is a negative lookbehind. So, the rest of the pattern would match only if it's not preceded by the look behind. This is determined by where the matching starts from.
Let's start simple - we define a regex .cde. and try to match it against some input:
First nine letters are abcdefgeh
^ ^
| |
.cde. start ------------- |
.cde. end -----------------
See on Regex101
So now we can see that the match is bcdef and is preceded by (among other characters) a. So, if we use that as a negative lookbehind (?<!a).cde. we will not get a match:
First nine letters are abcdefgeh
^^ ^
|| |
`(?<!a)` ----------| |
.cde. start ----------- |
.cde. end ----------------
See on Regex101
We could match the .cde. pattern but it's preceded by a which we don't want.
However, what happens if we defined the negative lookahead differently - as (?<!b).cde.:
First nine letters are abcdefgeh
^ ^
| |
.cde. start ----------- |
.cde. end ----------------
See on Regex101
We get a match for bcdefg because there is no b before this match. Therefore, it works fine. Yes, b is the first character of the match but doesn't show up before it. And this is the core of the lookarounds (lookbehind and lookaheads) - they are not included in the main match. In fact they fall under zero length matches since, they will be checked but won't appear as a match. In effect, they only work starting from some position but check the part of the input that will not go in the final match.
Now, if we return to your regex - (?<!test)\w+?\.?\w+#gmail\.com here is where each match starts:
test#gmail.com
^^ ^
|| |
\w+? -------| |
\w+ -------- |
#gmail\.com -----------
See on Regex101
(yes, it's slightly weird but both \w+? and \w+ both produce matches)
The negative lookbehind is for test and since it doesn't appear before the match, the pattern is satisfied.
You might wander what happens why does something like testfoo#gmail.com still produce a match - it has test and then other letters, right?
testfoo#gmail.com
^^ ^
|| |
\w+? -------| |
\w+ -------- |
#gmail\.com --------------
See on Regex101
Same result again. The problem is that \w+ will include all letters in a match, so even if the actual string test appears, it will be in the match, not before it.
To be able to differentiate the two, you have to avoid overlaps between the lookbehind pattern and the actual matching pattern.
You can decide to define the matching pattern differently (?<!test)h\w+?\.?\w+#gmail\.com, so the match has to start with an h. In that case there is no overlap and the matching pattern will not "hide" the lookbehind and make it irrelevant. Thus the pattern will match correctly against harry.potter#gmail.com, hagrid#gmail.com but will not match testhermione#gmail.com:
testhermione#gmail.com
^ ^^^ ^
| ||| |
(?<!test) -- ||| |
h ------|| |
\w+? -------| |
\w+ -------- |
#gmail\.com --------------
See on Regex101
Alternatively, you can define a lookbehind that doesn't overlap with the start of the matching pattern. But beware. Remember that regexes (like most things with computers) do what you tell them, not exactly what you mean. If we use the regular expression ``(?(negative lookahead istest-` now) then we test it against test-hermione#gmai.com, we get a match for ermione#gmail.com:
test-hermione#gmail.com
^ ^^ ^
| || |
(?<!test-) -- || |
\w+? --------| |
\w+ --------- |
#gmail\.com ---------------
See on Regex101
The regex says that we don't want anything preceded by test-, so the regex engine obliges - there is a test- before the h, so the regular expression engine discards it and the rest of the string works to fit the pattern.
So, bottom line
avoid having the match overlap with the lookbehind, or it's not actually a lookbehind any more - it's part of the match.
be careful - the regex engine will satisfy the lookbehind but in the most literal way possible with the least effort possible.
In order for this to work properly you need to both:
Use a negative lookahead (as opposed to a lookbehind, like your example)
Anchor the match (to prevent partial matches. Several anchors are possible, but in your case the best is probably \b, for word boundaries)
This is the result:
\b(?!test)\w+?\.?\w+#gmail\.com
See it live!

Hive regex: Positive lookahead to match '&' or end of string

I would like to match text between two strings, although the last string/character might not aways be available.
String1: 'www.mywebsite.com/search/keyword=toys'
String2: 'www.mywebsite.com/search/keyword=toys&lnk=hp1'
Here I want to match the value in keyword= that is 'toys' and I am using
(?<=keyword=)(.*)(?=&|$)
Works for String1 but for String2 it matches everything after '&'
What am I doing wrong?
.* is greedy. It takes everything it can, therefore stops at the end of the string ($) and not at the & character.
Change it to its non-greedy version - .*?
with t as
(
select explode
(
array
(
'www.mywebsite.com/search/keyword=toys'
,'www.mywebsite.com/search/keyword=toys&lnk=hp1'
)
) as (val)
)
select regexp_extract(val,'(?<=keyword=)(.*?)(?=&|$)',0)
from t
;
+------+
| toys |
+------+
| toys |
+------+
You do not need to bother with greediness when you need to match zero or more occurrences of any characters but a specific character (or set of characters). All you need is to get rid of the lookahead and the dot pattern and use [^&]* (or, if the value you expect should not be an empty string, [^&]+):
(?<=keyword=)[^&]+
Code:
select regexp_extract(val,'(?<=keyword=)[^&]+', 0) from t
See the regex demo
Note you do not even need a capturing group since the 0 argument instructs regexp_extract to retrieve the value of the whole match.
Pattern details
(?<=keyword=) - a positive lookbehind that matches a location that is immediately preceded with keyword=
[^&]+ - any 1+ chars other than & (if you use * instead of +, it will match 0 or more occurrences).

Can't get a specific regex to work in Perl

I have a string formatted like:
project-version-project_test-type-other_info-other_info.file_type
I can strip most of the information I need out of this string in most cases. My trouble arises when my version has an extra qualifying character in it (i.e. normally 5 characters but sometimes a 6th is added).
Previously, I was using substrings to remove the excess information and get the 'project_test-type' however, now I need to switch to a regex (mostly to handle that extra version character). I could keep using substrings and change the length depending on whether I have that extra version character or not but a regex seems more appropriate here.
I tried using patterns like:
my ($type) = $_ =~ /.*-.*-(.*)-.*/;
But the extra '-' in the 'project_test-type' means I can't simply space my regex using that character.
What regex can I use to get the 'project_test-type' out of my string?
More information:
As a more human readable example, the information is grouped in the following way:
project - version - project_test-type - other_info - other_info . file_type
'project' is a simple string of chars
'version' is normally a string of 5 integers, but is sometimes followed by a char, i.e. 11111 is normal and 11111A is the rarer occurence.
'project_test-type' is a specific test associated with a project that can have both '_' and '-' in it's otherwise char name
Both cases of 'other_info' are additional bits of information for the system like an IP address or another version number. The first has no fixed length while the second is always 10 characters long
Since no field other than the desired one can contain -, any extra - belongs to the desired field.
+--------------------------- project
| +--------------------- version
| | +----------------- project_test-type
| | | +---------- other_info
| | | | +---- other_info.file_type
| | | | |
____| ____| _| ____| ____|
/^[^-]*-[^-]*-(.*)-[^-]*-[^-]*\z/
[^-] matches a character that's not a -.
[^-]* matches zero or more characters that's aren't -.
To match everything:
/^([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)\.([a-zA-Z0-9]+)$/
[] defines character sets and ^ at the beginning of a set means "NOT". Also a - in a set usually means a range, unless it is at the beginning or end. So [^-]+ consumes as many non-dash characters as possible (at least one).
You can use
/\w+\s*-\s*\d{5}[a-zA-Z]?\s*-\s*(.*?)(?=\s*-\s*\d)/
Explanation:
\w+\s*- ==> match character sequence followed by any number of spaces and a -
\d{5}[a-zA-Z]? ==> always 5 digits with one or zero character
(.*?) => match everything in a non greedy way
(?=\s*-\s*\d) => look forward for a digit and stop (since IP starts with a digit)
Demo and Explanation
Greedy/non-greedy approach
($type) = /.*?-.*?-(.*)-.*-.*/;
.*? is a non-greedy match, meaning match any number of any character, but no more than necessary to match the regular expression. Using .* between the second and third dashes is a greedy match, matching as many characters as possible while still matching the regular expression, and using this will capture words with any extra dashes in them.