I'm having trouble matching the start and end of a regex on Python.
Essentially I'm confused about the when to use word boundaries /b and start/end anchors ^ $
My regex of
^[A-Z]{2}\d{2}
matches 4 letter characters (two uppercase letters, two digits) which is what I'm after
Matches AJ99, RD22, CP44 etc
However, I also noted that AJAJAJAJAJAJAJAJAJSJHS99 could be matched as well. I've tried used ^ and $ together to match the whole string. This doesn't work
^[A-Z]{2}\d{2}$ # this doesn't work
but
^[A-Z]{2}\d{2} # this is fine
[A-Z]{2}\d{2}$ # this is fine
The string I'm matching against is 4 characters long, but in the first two examples the regex could pick the start and end of a longer string respectively.
s = "NZ43" # 4 characters, match perfect! However....
s = "AM27272727" # matches the first example
s = "HAHSHSHSHDS57" # matches the second example
The position anchors ^ and $ place a restriction on the position of your matched chars:
Analyzing your complete regex:
^[A-Z]{2}\d{2}$
^ matches only at the beginning of the text
[A-Z]{2} exactly 2 uppercase Ascii alphabetic characters
\d{2} exactly 2 digits (equivalent to [0-9]{2})
$ matches only at the end of the text
If you remove one or both of the 2 position anchors (^ or $) you can match a substring starting from the beginning or the end as you stated above.
If you want to match exactly a word without using the start/end of the string use the \b anchor, like this:
``\b[A-Z]{2}\d{2}\b``
\b matches at the start/end of text and between a regex word (in regex a word char \w is intended as one of [a-zA-Z0-9_]) and one char not in the word group (available as \W).
The regex above matches WS24 in all the next strings:
WS24 alone
before WS24
WS24 after
before WS24 after
NZ43
It doesn't match:
AM27272727 (it will do if is AM27 272727 or AM27"272727
HAHSHSHSHDS57 (it will do if HAHSHSHSH DS75 or...you get it)
A demo online (the site will be useful to you also to experiment with regex).
The fact that your shown behaviour is like it's supposed to be, your question suggests that you maybe does not have fully understood how regular expressions work.
As a addition to the very good and informative answer of GsusRecovery, here's a site, that guides you through the concepts of regular expressions and tries to teach you the basics with a lessons-based system. To be clear, I do not want to tout this website, as there are plenty of those, but however I could really made a use of this one and so it's the one I'm suggesting.
Related
I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.
I really don't use RegEx that much. You could say I am RegEx n00b. I have been working on this issue for a half a day.
I am trying to write a pattern that looks backward from a number character. For example:
1. bob1 => bob
2. cat3 => cat
3. Mary34 => Mary
So far I have this (?![A-Z][a-z]{1,})([A-Za-z_])
It only matches for individual characters, I want all the characters before the number character. I tried to add the ^ and $ into my pattern and using an online simulator. I am unsure where to put the ^ and $.
NOTE: I am using RegEx for the .NET Framework
You may use a regex like
[\p{L}_]+(?=\d)
or
[\w-[\d]]+(?=\d)
See the regex demo
Pattern details
[\p{L}_]+ - any 1 or more letters (both lower- and uppercase) and/or _
OR
[\w-[\d]]+ - 1 or more word chars except digits (the -[] inside a character class is a character class subtraction construct)
(?=\d) - a positive lookahead that requires a digit to appear immediately to the right of the current location
If we break down your RegEx, we see:
(?![A-Z][a-z]{1,}) which says "look ahead to find a string that is NOT one uppercase letter followed one or more lowercase letters" and ([A-Za-z_]) which says "match one letter or underscore". This should end up matching any single lowercase letter.
If I understand what you want to achieve, then you want all of the letters before a number. I would write something like that as:
\b([a-zA-Z]+)[0-9]
This will start at a word boundary \b, match one or more letters, and require a digit right after the matched string.
(The syntax I used seems to match this document about .NET RegEx: https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions)
In light of Wiktor Stribizew's comment, here is a pure match RegEx:
\b[a-zA-Z_]+(?=[0-9])
This matches the pattern and then looks ahead for the digit. This is better than my first lookahead attempt. (Thank you Wiktor.)
http://www.rexegg.com/regex-lookarounds.html
Using regex, I'm trying to match any string of characters that meets the following conditions (in the order displayed):
Contains a dollar sign $; then
at least one letter [a-zA-Z]; then
zero or more letters, numbers, underscores, periods (dots), opening brackets, and/or closing brackets [a-zA-Z0-9_.\[\]]*; then
one pipe character |; then
one at sign #; then
at least one letter [a-zA-Z]; then
zero or more letters, numbers, and/or underscores [a-zA-Z0-9_]*; then
zero colons :
In other words, if a colon is found at the end of the string, then it should not count as a match.
Here are some examples of valid matches:
$tmp1|#hello
$x2.h|#hi_th3re
Valid match$here|#in_the middle of other characters
And here are some examples of invalid matches:
$tmp2|#not_a_match:"because there is a colon"
$c.4a|#also_no_match:
Here are some of the patterns I've tried:
(\$[a-zA-Z])([a-zA-Z0-9_.\[\]]*)(\|#)([a-zA-Z][a-zA-Z0-9_]*(?!.[:]))
(\$[a-zA-Z])([a-zA-Z0-9_.\[\]]+)?(\|#)([a-zA-Z][a-zA-Z0-9_]*(?![:]))
(\$[a-zA-Z])([a-zA-Z0-9_.\[\]]+)?(\|#)([a-zA-Z][a-zA-Z0-9_]*)([^:])
This pattern will do what you need
\$[A-Za-z]+[\w.\[\]]*[|]#[A-Za-z]+[\w]*+(?!:)
Regex Demo
I am using possessive quantifiers to cut down the backtracking using [\w]*+. You can also use atomic groups instead of possessive quantifiers like
\$[A-Za-z]+[\w.\[\]]*[|]#[A-Za-z]+(?>[\w]*)(?!:)
NOTE
\w => [A-Za-z0-9_]
I tested your third pattern in Regex 101 and it appears to be working correctly:
^.*(\$[a-zA-Z])([a-zA-Z0-9_.\[\]]+)?(\|#)([a-zA-Z][a-zA-Z0-9_]*)([^:]).*$
The only change I needed to make to the regex to make it work was to add anchors ^ and $ to the start and end of the regex. I also allowed for your pattern to occur as a substring in the middle of a larger string.
By the way, you had the following example as a string which should not match:
$tmp2|#not_a_match:"because there is a colon"
However, even if we remove the colon from this string it will still not match because it contains quotes which are not allowed.
Regex101
I have the following regex:
^(?=.{8}$).+
The way I understand this is it will accept 8 of any type of character, followed by 1 or more of any character. I feel I am not grasping how a Positive Lookahead works. Because both sections of the Regex are looking for '.' wouldn't any series of characters fit this?
My question is, how does the positive lookahead effect this regex and what is an example of a matching string?
The following did not match when supplied in the following regex tool:
123456781
(12345678)1
(12345678)
(abcdefgh)a
(abcdefgh)
abc
123
EDIT: Removed first two data entries as I clearly wasn't using the regex tool correctly as they now match with exactly 8 characters.
^(?=.{8}$).+
will match the string
aaaaaaaa
Reasoning:
The content inside of the brackets is a lookahead, since it starts with ?=.
The content inside of a lookahead is parsed - it is not interpreted literally.
Thus, the lookahead only allows the regex to match if .{8}$ would match (at the start of the string, in this case). So the string has to be exactly eight characters then it has to end, as evidenced by $.
Then .+ will match those eight characters.
It is trying to match:
^ # start of line, but...
(?=.{8}$) # only if it precedes exactly 8 characters and the end of line
.+ # this one matches those 8 characters
and from your input, it should also match these (try this engine with match at line breaks checked):
12345678
abcdefgh
Matching 12345678 works in ruby:
'12345678' =~ /^(?=.{8}$).+/
=> 0
Maybe your test site don't support look ahead on regexps?
There are now different requirements to the regex I am looking for, and it is too complex to solve it on my own.
I need to search for a specific string with the following requirements:
String starts with "fu: and ends with "
In between those start and end requirements there can be any other string which has the following requirements:
2.1. Less than 50 characters
2.2. Only lower case
2.3. No trailing spaces
2.4. No space between "fu: and the other string.
The result of the regex should be cases where case no' 1 matches but cases no' 2./2.1/2.2/2.3/2.4 don't.
At the moment I have following regex: "fu:([^"]*?[A-Z][^"]*?)",
which finds strings with start with "fu: and end with " with any upper case inbetween like this one:
"fu:this String is wrong cause the s from string is upper case"
I hope it all makes sense, I tried to get into regex but this problem seems to complex for someone who is not working with regex every day.
[Edit]
Apparently I was not clear enough. I want to have matches which are "wrong".
I am looking for the complement of this regex: "fu:(?:[a-z][a-z ]{0,47}[a-z]|[a-z]{0,2})"
some examples:
Match: "fu: this is a match"
Match: "fu:This is a match"
Match: "fu:this is a match "
NO Match: "fu:this is no match"
Sorry, its not easy to explain :)
Try the following:
"fu:([a-z](?:[a-z ]{0,48}[a-z])?)"
This will match any string that begins with "fu: and ends with a " and the string between those will contain 1-50 characters - only lower-case and not able to begin with a space nor have trailing spaces.
"fu: # begins with "fu:
( # group to match
[a-z] # starts with at least one character
(?: # non-matching sub-group
[a-z ]{0,48} # matches 0-48 a-z or space characters
[a-z] # sub-group must end with a character
)? # group is not required
)
" # ends with "
EDIT: In the event that you need an empty-string to match too, i.e. the full string is "fu:", you can add another ? to the end of the matching-group in the regex:
"fu:([a-z](?:[a-z ]{0,48}[a-z])?)?"
I've kept the two regexes separated (one that allows 1-50 characters in the string and one that allows 0-50) to show the minor difference.
EDIT #2: To match the inverse of the above, i.e. - to find all strings that do not match the required format, you can use:
^((?!"fu:([a-z](?:[a-z ]{0,48}[a-z])?)?").)*$
This will explicitly match any line that does not match that pattern. This will consequently also match lines that do not contain "fu: - if that matters.
The only way I can figure out to truly match the opposite of the above and still include the anchors of "fu: and " are to explicitly attempt to match the rules that fail:
"fu:([^a-z].*|[^"]{51,}|[a-z]([^"]*?[A-Z][^"]*?)+|[a-z ]{0,49}[ ])"
This regex will match anything that starts with not a lowercase a-z character, any string that's longer than 50 characters, any string that contains an uppercase letter, or any string that has trailing whitespace. For each additional rule, you'll need to update the regex to match the opposite of what's needed.
My recommendation is, in whatever language you're using, to match all input strings that actually follow your requirements - and if there are no matches then that string must violate your rules.
"fu:([^A-Z" ](?:[^A-Z"]{0,48}[^A-Z" ])?)"
The above regex should match the specified requirements.
That's probably what you need
"fu:([a-z](?:[a-z ]{,48}[a-z])?)"
Try this:
"fu:(?:[a-z][a-z ]{0,47}[a-z]|[a-z]?)"