I'm trying to write a syntax rule for a Vim plugin I'm writing, and I'm having trouble writing a Vim regex that will match an # symbol followed by an identifier, which is defined as two letters followed by any number of accepted characters. Here's what I have so far:
syntax match aldaAtMarker "\v#[a-zA-Z]{2,}[\w[:digit:]\-+'()]*"
I know that everything after the # works (at least, as far as I can tell) because I copy-pasted it from an aldaIdentifier rule that appears to work correctly. But, I'm having trouble inserting prepending the literal # symbol because the Vim regex system evidently ascribes a special meaning to # (see :help syntax and grep for #).
With my syntax rule as written above, trying to load the plugin results in the following errors:
Error detected while processing /home/dave/.vim/bundle/vim-alda/syntax/alda.vim:
line 21:
E866: (NFA regexp) Misplaced #
Press ENTER or type command to continue
Error detected while processing /home/dave/.vim/bundle/vim-alda/syntax/alda.vim:
line 21:
E64: # follows nothing
Press ENTER or type command to continue
Error detected while processing /home/dave/.vim/bundle/vim-alda/syntax/alda.vim:
line 21:
E475: Invalid argument: aldaAtMarker "\v#[a-zA-Z]{2,}[\w[:digit:]\-+'()]*"
Press ENTER or type command to continue
If I replace # with \#, there are no errors, but the wrong things are highlighted, which makes me think that the \# in my regex is being interpreted in a special way instead of being taken for a literal # character.
I'm clearly missing something and My Google-fu is failing me. How do I include a literal # symbol in a Vim regex in "very magic" (\v) mode?
from here :
The recommended is \m magic which is the default setting.
Otherwise, literal # can be matched always with character set [#].
3. Magic */magic*
Some characters in the pattern are taken literally. They match with the same
character in the text. When preceded with a backslash however, these
characters get a special meaning.
Other characters have a special meaning without a backslash. They need to be
preceded with a backslash to match literally.
If a character is taken literally or not depends on the 'magic' option and the
items mentioned next.
*/\m* */\M*
Use of "\m" makes the pattern after it be interpreted as if 'magic' is set,
ignoring the actual value of the 'magic' option.
Use of "\M" makes the pattern after it be interpreted as if 'nomagic' is used.
*/\v* */\V*
Use of "\v" means that in the pattern after it all ASCII characters except
'0'-'9', 'a'-'z', 'A'-'Z' and '_' have a special meaning. "very magic"
Use of "\V" means that in the pattern after it only the backslash has a
special meaning. "very nomagic"
Examples:
after: \v \m \M \V matches
'magic' 'nomagic'
$ $ $ \$ matches end-of-line
. . \. \. matches any character
* * \* \* any number of the previous atom
() \(\) \(\) \(\) grouping into an atom
| \| \| \| separating alternatives
\a \a \a \a alphabetic character
\\ \\ \\ \\ literal backslash
\. \. . . literal dot
\{ { { { literal '{'
a a a a literal 'a'
{only Vim supports \m, \M, \v and \V}
It is recommended to always keep the 'magic' option at the default setting,
which is 'magic'. This avoids portability problems. To make a pattern immune
to the 'magic' option being set or not, put "\m" or "\M" at the start of the
pattern.
It turns out that I had another syntax rule that was highlighting some additional things in the same color and throwing me off.
In very magic mode, \# does appear to correctly escape the # symbol:
syntax match aldaAtMarker "\v\#[a-zA-Z]{2,}[\w[:digit:]\-+'()]*"
Related
I have a code which works need to make red only the last two letters
$text = '£5,485.00';
$text = preg_replace('/(\b[a-z])/i','<span style="color:red;">\1</span>',$text);
echo $text;
need like this enter image description here
taken your question verbatim:
preg_replace('/\w{2}$/','<span style="color:red;">\0</span>', $text);
^^^^^^ ^^
\w{2} : two word characters \0 : main matching group
$ : anchored at the end
You may want to support Unicode (/u - u modifier) and prevent the $ to match end-of-string and new-line at end-of-string (/D - D modifier):
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid.
D (PCRE_DOLLAR_ENDONLY)
If this modifier is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. There is no equivalent to this modifier in Perl.
Why this returns me [ABC]
s='''ABC'''
# use findall to return the parts we want
print(re.findall(r'ABC\Z', s))
While this returns me nothing?
s='''ABC'''
# use findall to return the parts we want
print(re.findall(r'ABC[\Z]', s))
Root cause
When an anchor or word boundary are placed into a character class they lose their special meaning. Acc. to the re documentation:
[]
Used to indicate a set of characters.
and
\b
... Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.
\Z behaves the same way as \b: inside a character class, the anchor meaning is lost. Note that r'\Z' does not produces any warning in Python versions before 3.6 and matches a single Z because it is an unknown escape for Python re:
Unknown escapes such as \j are left alone.
Starting with Python 3.6, you cannot use a \ with an ASCII letter after it that is an unknown escape (see reference):
Changed in version 3.6: Unknown escapes consisting of '\' and an ASCII letter now are errors.
So, r'[\Z]' in Python up to 3.5 will work as follows:
import re
print(re.findall(r'[\Z]', '\\Z')) # => ['Z']
Solution
To match either a (string of) letter(s) or a zero-width assertion, use a grouping construct, capturing (...) or non-capturing (?:...), with an alternation operator |:
(?:\n|\Z)
This will match either a newline symbol, or the very end of string (in Python, \Z matches the same position in string as \z in PCRE/Perl/.NET).
s='''ABC'''
# use findall to return the parts we want
print(re.findall(r'ABC[\Z]', s))
error: bad escape \Z at position 4
this code will return an error.
There are some rules about character class:
Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
what happened in the character class is: regex engine trying to escape the letter Z. Since the Z is not special characters in regex, the engine will return an error.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
What do these expressions mean? Where can I learn about their usage?
\\d
\\D
\\s
\\S
\\w
\\W
\\t
\\n
^
$
\
| etc..
I need to use the stringr package and i have absolutely no idea how to use these .
From ?regexp, in the Extended Regular Expressions section:
The caret ‘^’ and the dollar sign ‘$’ are metacharacters that
respectively match the empty string at the beginning and end of a
line. The symbols ‘\<’ and ‘>’ match the empty string at the
beginning and end of a word. The symbol ‘\b’ matches the empty
string at either edge of a word, and ‘\B’ matches the empty string
provided it is not at an edge of a word. (The interpretation of
‘word’ depends on the locale and implementation: these are all
extensions.)
From Perl-like Regular Expressions:
The escape sequences ‘\d’, ‘\s’ and ‘\w’ represent any decimal
digit, space character and ‘word’ character (letter, digit or
underscore in the current locale: in UTF-8 mode only ASCII letters
and digits are considered) respectively, and their upper-case
versions represent their negation. Vertical tab was not regarded
as a space character in a ‘C’ locale before PCRE 8.34 (included in
R 3.0.3). Sequences ‘\h’, ‘\v’, ‘\H’ and ‘\V’ match horizontal
and vertical space or the negation. (In UTF-8 mode, these do
match non-ASCII Unicode code points.)
Note that backslashes usually need to be doubled/protected in R input, e.g. you would use "\\h" to match horizontal space.
From ?Quotes:
Backslash is used to start an escape sequence inside character
constants. Escaping a character not in the following table is an
error.
\n newline
\r carriage return
\t tab
As others comment above, you may need a little more help if you're getting started with regular expressions for the first time. This is a little bit off-topic for StackOverflow (links to off-site resources), but there are some links to regular expression resources at the bottom of the gsubfn package overview. Or Google "regular expression tutorial" ...
The RegEx:
^([0-9\.]+)\Q|\E([^\Q|\E])\Q|\E
does not match the string:
1203730263.912|12.66.18.0|
Why?
From PHP docs,
\Q and \E can be used to ignore regexp metacharacters in the pattern.
For example:
\w+\Q.$.\E$ will match one or more word characters, followed by literals .$. and anchored at the end of the string.
And your regex should be,
^([0-9\.]+)\Q|\E([^\Q|\E]*)\Q|\E
OR
^([0-9\.]+)\Q|\E([^\Q|\E]+)\Q|\E
You forget to add + after [^\Q|\E]. Without +, it matches single character.
DEMO
Explanation:
^ Starting point.
([0-9\.]+) Captures digits or dot one or more times.
\Q|\E In PCRE, \Q and \E are referred to as Begin sequence. Which treats any character literally when it's included in that block. So | symbol in that block tells the regex engine to match a literal |.
([^\Q|\E]+) Captures any character not of | one or more times.
\Q|\E Matches a literal pipe symbol.
The accepted answer seems somewhat incorrect so I wanted to address this for future readers.
If you did not already know, using \Q and \E ensures that any character between \Q ... \E will be matched literally, not interpreted as a metacharacter by the regular expression engine.
First and most important, \Q and \E is NOT usable within a bracketed character class [].
[^\Q|\E] # Incorrect
[^|] # Correct
Secondly, you do not follow that class with a quantifier. Using this, the correct syntax would be:
^([0-9.]+)\Q|\E([^|]+)\Q|\E
Although, it is much simpler to write this out as:
^([0-9.]+)\|([^|]+)\|
I can only find negative lookbehind for this , something like (?<!\\).
But this won't compile in c++ and flex. It seems like both regex.h nor flex support this?
I am trying to implement a shell which has to get treat special char like >, < of | as normal argument string if preceded by backslash. In other word, only treat special char as special if not preceded by 0 or even number of '\'
So echo \\>a or echo abc>a should direct output to a
but echo \>a should print >a
What regular expression should I use?
I'm using flex and yacc to parse the input.
In a Flex rule file, you'd use \\ to match a single backslash '\' character. This is because the \ is used as an escape character in Flex.
BACKSLASH \\
LITERAL_BACKSLASH \\\\
LITERAL_LESSTHAN \\\\<
LITERAL_GREATERTHAN \\\\>
LITERAL_VERTICALBAR \\\\|
If I follow you correctly, in your case you want "\>" to be treated as literal '>' but "\\>" to be treated as literal '\' followed by special redirect. You don't need negative look behind or anything particularly special to accomplish this as you can build one rule that would accept both your regular argument characters and also the literal versions of your special characters.
For purposes of discussion, let's assume that your argument/parameter can contain any character but ' ', '\t', and the special forms of '>', '<', '|'. The rule for the argument would then be something like:
ARGUMENT ([^ \t\\><|]|\\\\|\\>|\\<|\\\|)+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\\\ matches any instance of "\" (i.e. a literal backslash)
\\> matches any instance of ">" (i.e. a literal greater than)
\\< matches any instance of "\<" (i.e. a literal less than)
\\\| matches any instance of "\|" (i.e. a literal vertical bar/pipe)
Actually... You can probably just shorten that rule to:
ARGUMENT ([^ \t\\><|]|\\[^ \t\r\n])+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\[^ \t\r\n] matches any character preceded by a '\' in your input except for whitespace (which will handle all of your special characters and allow for literal forms of all other characters)
If you want to allow for literal whitespace in your arguments/parameters then you could shorten the rule even further but be careful with using \\. for the second half of the rule alternation as it may or may not match " \n" (i.e. eat your trailing command terminator character!).
Hope that helps!
You cannot easily extract single escaped characters from a command-line, since you will not know the context of the character. In the simplest case, consider the following:
LessThan:\<
BackslashFrom:\\<
In the first one, < is an escaped character; in the second one, it is not. If your language includes quotes (as most shells do), things become even more complicated. It's a lot better to parse the string left to right, one entity at a time. (I'd use flex myself, because I've stopped wasting my time writing and testing lexers, but you might have some pedagogical reason to do so.)
If you really need to find a special character which shouldn't be special, just search for it (in C++98, where you don't have raw literals, you'll have to escape all of the backslashes):
regex: (\\\\)*\\[<>|]
(An even number -- possibly 0 -- of \, then a \ and a <, > or |)
as a C string => "(\\\\\\\\)*\\\\[<>|]"