Why this returns me [ABC]
s='''ABC'''
# use findall to return the parts we want
print(re.findall(r'ABC\Z', s))
While this returns me nothing?
s='''ABC'''
# use findall to return the parts we want
print(re.findall(r'ABC[\Z]', s))
Root cause
When an anchor or word boundary are placed into a character class they lose their special meaning. Acc. to the re documentation:
[]
Used to indicate a set of characters.
and
\b
... Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.
\Z behaves the same way as \b: inside a character class, the anchor meaning is lost. Note that r'\Z' does not produces any warning in Python versions before 3.6 and matches a single Z because it is an unknown escape for Python re:
Unknown escapes such as \j are left alone.
Starting with Python 3.6, you cannot use a \ with an ASCII letter after it that is an unknown escape (see reference):
Changed in version 3.6: Unknown escapes consisting of '\' and an ASCII letter now are errors.
So, r'[\Z]' in Python up to 3.5 will work as follows:
import re
print(re.findall(r'[\Z]', '\\Z')) # => ['Z']
Solution
To match either a (string of) letter(s) or a zero-width assertion, use a grouping construct, capturing (...) or non-capturing (?:...), with an alternation operator |:
(?:\n|\Z)
This will match either a newline symbol, or the very end of string (in Python, \Z matches the same position in string as \z in PCRE/Perl/.NET).
s='''ABC'''
# use findall to return the parts we want
print(re.findall(r'ABC[\Z]', s))
error: bad escape \Z at position 4
this code will return an error.
There are some rules about character class:
Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
what happened in the character class is: regex engine trying to escape the letter Z. Since the Z is not special characters in regex, the engine will return an error.
Related
I have a code which works need to make red only the last two letters
$text = '£5,485.00';
$text = preg_replace('/(\b[a-z])/i','<span style="color:red;">\1</span>',$text);
echo $text;
need like this enter image description here
taken your question verbatim:
preg_replace('/\w{2}$/','<span style="color:red;">\0</span>', $text);
^^^^^^ ^^
\w{2} : two word characters \0 : main matching group
$ : anchored at the end
You may want to support Unicode (/u - u modifier) and prevent the $ to match end-of-string and new-line at end-of-string (/D - D modifier):
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid.
D (PCRE_DOLLAR_ENDONLY)
If this modifier is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. There is no equivalent to this modifier in Perl.
This question already has answers here:
Confused about backslashes in regular expressions [duplicate]
(3 answers)
Closed 4 years ago.
I'm reading python doc of re library and quite confused by the following paragraph:
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.
How is \\\\ evaluated?
\\\\ -> \\\ -> \\ cascadingly
or \\\\ -> \\ in pairs?
I know \ is a meta character just like |, I can do
>>> re.split('\|', 'a|b|c|d') # split by literal '|'
['a', 'b', 'c', 'd']
but
>>> re.split('\\', 'a\b\c\d') # split by literal '\'
Traceback (most recent call last):
gives me error, it seems that unlike \| the \\ evaluates more than once.
and I tried
>>> re.split('\\\\', 'a\b\c\d')
['a\x08', 'c', 'd']
which makes me even more confused...
There are two things going on here - how strings are evaluated, and how regexes are evaluated.
'a\b\c\d' in python <3.7 code represents the string a<backspace>\c\d
'\\\\' in python code represents the string \\.
the string \\ is a regex pattern that matches the character \
Your problem here is that the string you're searching is not what you expect.
\b is the backspace character, \x08. \c and \d are not real characters at all. In python 3.7, this will be an error.
I assume you meant to spell it r'a\b\c\d' or 'a\\b\\c\\d'
re.split('\\', 'a\b\c\d') # split by literal '\'
You forgot that '\' in the second one is escape character, it would work if the second one was changed:
re.split(r'\\', 'a\\b\\c\\d')
This r at the start means "raw" string - escape characters are not evaluated.
Think about the implications of evaluating backslashes cascadingly:
If you wanted the string \n (not the newline symbol, but literally \n), you couldn't find a sequence of characters to get said string.
\n would be the newline symbol, \\n would be evaluated to \n, which in turn would become the newline symbol again. This is why escape sequencens are evaluated in pairs.
So you need to write \\ within a string to get a single \, but you need to have to backslashes in your string so that the regex will match the literal \. Therefore you will need to write \\\\ to match a literal backslash.
You have a similar problem with your a\b\c\d string. The parser will try to evaluate the escape sequences, and \b is a valid sequence for 'backspace', represented as \x08. You will need to escape your backslashes here, too, like a\\b\\c\\d.
I'm trying to write a syntax rule for a Vim plugin I'm writing, and I'm having trouble writing a Vim regex that will match an # symbol followed by an identifier, which is defined as two letters followed by any number of accepted characters. Here's what I have so far:
syntax match aldaAtMarker "\v#[a-zA-Z]{2,}[\w[:digit:]\-+'()]*"
I know that everything after the # works (at least, as far as I can tell) because I copy-pasted it from an aldaIdentifier rule that appears to work correctly. But, I'm having trouble inserting prepending the literal # symbol because the Vim regex system evidently ascribes a special meaning to # (see :help syntax and grep for #).
With my syntax rule as written above, trying to load the plugin results in the following errors:
Error detected while processing /home/dave/.vim/bundle/vim-alda/syntax/alda.vim:
line 21:
E866: (NFA regexp) Misplaced #
Press ENTER or type command to continue
Error detected while processing /home/dave/.vim/bundle/vim-alda/syntax/alda.vim:
line 21:
E64: # follows nothing
Press ENTER or type command to continue
Error detected while processing /home/dave/.vim/bundle/vim-alda/syntax/alda.vim:
line 21:
E475: Invalid argument: aldaAtMarker "\v#[a-zA-Z]{2,}[\w[:digit:]\-+'()]*"
Press ENTER or type command to continue
If I replace # with \#, there are no errors, but the wrong things are highlighted, which makes me think that the \# in my regex is being interpreted in a special way instead of being taken for a literal # character.
I'm clearly missing something and My Google-fu is failing me. How do I include a literal # symbol in a Vim regex in "very magic" (\v) mode?
from here :
The recommended is \m magic which is the default setting.
Otherwise, literal # can be matched always with character set [#].
3. Magic */magic*
Some characters in the pattern are taken literally. They match with the same
character in the text. When preceded with a backslash however, these
characters get a special meaning.
Other characters have a special meaning without a backslash. They need to be
preceded with a backslash to match literally.
If a character is taken literally or not depends on the 'magic' option and the
items mentioned next.
*/\m* */\M*
Use of "\m" makes the pattern after it be interpreted as if 'magic' is set,
ignoring the actual value of the 'magic' option.
Use of "\M" makes the pattern after it be interpreted as if 'nomagic' is used.
*/\v* */\V*
Use of "\v" means that in the pattern after it all ASCII characters except
'0'-'9', 'a'-'z', 'A'-'Z' and '_' have a special meaning. "very magic"
Use of "\V" means that in the pattern after it only the backslash has a
special meaning. "very nomagic"
Examples:
after: \v \m \M \V matches
'magic' 'nomagic'
$ $ $ \$ matches end-of-line
. . \. \. matches any character
* * \* \* any number of the previous atom
() \(\) \(\) \(\) grouping into an atom
| \| \| \| separating alternatives
\a \a \a \a alphabetic character
\\ \\ \\ \\ literal backslash
\. \. . . literal dot
\{ { { { literal '{'
a a a a literal 'a'
{only Vim supports \m, \M, \v and \V}
It is recommended to always keep the 'magic' option at the default setting,
which is 'magic'. This avoids portability problems. To make a pattern immune
to the 'magic' option being set or not, put "\m" or "\M" at the start of the
pattern.
It turns out that I had another syntax rule that was highlighting some additional things in the same color and throwing me off.
In very magic mode, \# does appear to correctly escape the # symbol:
syntax match aldaAtMarker "\v\#[a-zA-Z]{2,}[\w[:digit:]\-+'()]*"
I'm trying to use regular expressions to split an EDIFACT line. In EDIFACT, the components of a line are separated by a token, usually "+". The "+" can be escaped by preceding it with a "?". I can achieve this much using the expression
(?<!\?)\+
So far so good. However, the escape character itself can be escaped, by doubling it up ("??"). Here are some examples and the output when split
ABC+DEF+GHI => ABC, DEF and GHI (3 elements)
ABC?+DEF+GHI => ABC?+DEF and GHI (2 elements)
ABC??+DEF+GHI => ABC??, DEF and GHI (3 elements)
It's the third one I'm struggling on. How would I tweak the expression I'm using to behave as required?
Strings that can have escaped entities cannot be split with lookaround-based regexps. Instead, matching is a more reliable approach: match all substrings that are not escaped sequences and not the delimiter, and then those that are.
(?:[^?+]|\?.)+
See the regex demo
The (?:[^?+]|\?.)+ pattern matches 1+ characters other than ? and + or a sequence of a literal ? followed with any character (but a newline without DOTALL modifier).
The RegEx:
^([0-9\.]+)\Q|\E([^\Q|\E])\Q|\E
does not match the string:
1203730263.912|12.66.18.0|
Why?
From PHP docs,
\Q and \E can be used to ignore regexp metacharacters in the pattern.
For example:
\w+\Q.$.\E$ will match one or more word characters, followed by literals .$. and anchored at the end of the string.
And your regex should be,
^([0-9\.]+)\Q|\E([^\Q|\E]*)\Q|\E
OR
^([0-9\.]+)\Q|\E([^\Q|\E]+)\Q|\E
You forget to add + after [^\Q|\E]. Without +, it matches single character.
DEMO
Explanation:
^ Starting point.
([0-9\.]+) Captures digits or dot one or more times.
\Q|\E In PCRE, \Q and \E are referred to as Begin sequence. Which treats any character literally when it's included in that block. So | symbol in that block tells the regex engine to match a literal |.
([^\Q|\E]+) Captures any character not of | one or more times.
\Q|\E Matches a literal pipe symbol.
The accepted answer seems somewhat incorrect so I wanted to address this for future readers.
If you did not already know, using \Q and \E ensures that any character between \Q ... \E will be matched literally, not interpreted as a metacharacter by the regular expression engine.
First and most important, \Q and \E is NOT usable within a bracketed character class [].
[^\Q|\E] # Incorrect
[^|] # Correct
Secondly, you do not follow that class with a quantifier. Using this, the correct syntax would be:
^([0-9.]+)\Q|\E([^|]+)\Q|\E
Although, it is much simpler to write this out as:
^([0-9.]+)\|([^|]+)\|