Meaning of regular expressions like - \\d , \\D, ^ , $ etc [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
What do these expressions mean? Where can I learn about their usage?
\\d
\\D
\\s
\\S
\\w
\\W
\\t
\\n
^
$
\
| etc..
I need to use the stringr package and i have absolutely no idea how to use these .

From ?regexp, in the Extended Regular Expressions section:
The caret ‘^’ and the dollar sign ‘$’ are metacharacters that
respectively match the empty string at the beginning and end of a
line. The symbols ‘\<’ and ‘>’ match the empty string at the
beginning and end of a word. The symbol ‘\b’ matches the empty
string at either edge of a word, and ‘\B’ matches the empty string
provided it is not at an edge of a word. (The interpretation of
‘word’ depends on the locale and implementation: these are all
extensions.)
From Perl-like Regular Expressions:
The escape sequences ‘\d’, ‘\s’ and ‘\w’ represent any decimal
digit, space character and ‘word’ character (letter, digit or
underscore in the current locale: in UTF-8 mode only ASCII letters
and digits are considered) respectively, and their upper-case
versions represent their negation. Vertical tab was not regarded
as a space character in a ‘C’ locale before PCRE 8.34 (included in
R 3.0.3). Sequences ‘\h’, ‘\v’, ‘\H’ and ‘\V’ match horizontal
and vertical space or the negation. (In UTF-8 mode, these do
match non-ASCII Unicode code points.)
Note that backslashes usually need to be doubled/protected in R input, e.g. you would use "\\h" to match horizontal space.
From ?Quotes:
Backslash is used to start an escape sequence inside character
constants. Escaping a character not in the following table is an
error.
\n newline
\r carriage return
\t tab
As others comment above, you may need a little more help if you're getting started with regular expressions for the first time. This is a little bit off-topic for StackOverflow (links to off-site resources), but there are some links to regular expression resources at the bottom of the gsubfn package overview. Or Google "regular expression tutorial" ...

Related

\w doesn't work in vim search replace but a-zA-Z does? [duplicate]

This question already has answers here:
Vim regex with metacharacters inside bracket
(3 answers)
Closed 4 years ago.
tldr
[a-zA-Z\.-] works in Vim regex search replace, but [\w\.-] does not.
The text I'm searching:
1 string.here blah blah
24 another-string.here blah.
1523 another-string.goes.here. blah123
Desired output
string.here
another-string.here
another-string.goes.here
My Question
Why does this work:
:%s/\v^\d+\s+([a-zA-z\.-]+)\s+.*/\1/g
But this does not:
:%s/\v^\d+\s+([\w\.-]+)\s+.*/\1/g
E486: Pattern not found :%s/\v^\d+\s+([\w\.-]+)\s+.*/\1/g
The only difference between the two is a-zA-Z vs \w inside square brackets. But doesn't \w equal a-zA-Z (plus some other non-whitespace characters not in this example text)?
I'm using default vim. Unmodified. Whatever comes with Ubuntu.
Non-vim platforms
When I try with the atom text editor instead of vim, both expressions work.
Search: ^\d+\s+([a-zA-z\.-]+)\s+.*
Replace: $1
When I try with RegExr both expressions work. (Although I have to add the multiline tag)
Other things I've tried
My understanding is that \v is necessary for avoiding escaping hell. I've tried without it:
:%s/^\d\+\s\+\([a-zA-Z\.-]\+\)\s\+.*/\1/g
works
:%s/^\d\+\s\+\([\w\.-]\+\)\s\+.*/\1/g
does not work. ("Pattern not found")
I've also tried adding the m flag (so the end is /gm) but that didn't work
E488: Trailing characters
I've also tried without the ^.
:%s/\d\+\s\+\([\w\.-]\+\)\s\+.*/\1/g
E486: Pattern not found: \d\+\s\+\([\w\.-]\+\)\s\+.*
I've also tried using \\w instead of \w.
:%s/\d\+\s\+\([\\w\.-]\+\)\s\+.*/\1/g
E486: Pattern not found: \d\+\s\+\([\\w\.-]\+\)\s\+.*
I've also tried using \[ \] instead of [ ].
:%s/\d\+\s\+\(\[\\w\.-\]\+\)\s\+.*/\1/g
E486: Pattern not found: \d\+\s\+\(\[\\w\.-\]\+\)\s\+.*
[a-zA-Z\.-] works in Vim regex search replace, but [\w\.-] does not.
[a-zA-Z\.-] is a collection of characters containing:
every character from a to z,
every character from A to Z,
the character .,
and the character -.
:help /collection is regrettably not explicit about this but character classes like \w are interpreted as "escaped w", and thus "plain w", so [\w\.-] is really just [w\.-] which is not what you want:
the character w,
the character .,
and the character -.

How do I include a literal # (at symbol) in Vim regex?

I'm trying to write a syntax rule for a Vim plugin I'm writing, and I'm having trouble writing a Vim regex that will match an # symbol followed by an identifier, which is defined as two letters followed by any number of accepted characters. Here's what I have so far:
syntax match aldaAtMarker "\v#[a-zA-Z]{2,}[\w[:digit:]\-+'()]*"
I know that everything after the # works (at least, as far as I can tell) because I copy-pasted it from an aldaIdentifier rule that appears to work correctly. But, I'm having trouble inserting prepending the literal # symbol because the Vim regex system evidently ascribes a special meaning to # (see :help syntax and grep for #).
With my syntax rule as written above, trying to load the plugin results in the following errors:
Error detected while processing /home/dave/.vim/bundle/vim-alda/syntax/alda.vim:
line 21:
E866: (NFA regexp) Misplaced #
Press ENTER or type command to continue
Error detected while processing /home/dave/.vim/bundle/vim-alda/syntax/alda.vim:
line 21:
E64: # follows nothing
Press ENTER or type command to continue
Error detected while processing /home/dave/.vim/bundle/vim-alda/syntax/alda.vim:
line 21:
E475: Invalid argument: aldaAtMarker "\v#[a-zA-Z]{2,}[\w[:digit:]\-+'()]*"
Press ENTER or type command to continue
If I replace # with \#, there are no errors, but the wrong things are highlighted, which makes me think that the \# in my regex is being interpreted in a special way instead of being taken for a literal # character.
I'm clearly missing something and My Google-fu is failing me. How do I include a literal # symbol in a Vim regex in "very magic" (\v) mode?
from here :
The recommended is \m magic which is the default setting.
Otherwise, literal # can be matched always with character set [#].
3. Magic */magic*
Some characters in the pattern are taken literally. They match with the same
character in the text. When preceded with a backslash however, these
characters get a special meaning.
Other characters have a special meaning without a backslash. They need to be
preceded with a backslash to match literally.
If a character is taken literally or not depends on the 'magic' option and the
items mentioned next.
*/\m* */\M*
Use of "\m" makes the pattern after it be interpreted as if 'magic' is set,
ignoring the actual value of the 'magic' option.
Use of "\M" makes the pattern after it be interpreted as if 'nomagic' is used.
*/\v* */\V*
Use of "\v" means that in the pattern after it all ASCII characters except
'0'-'9', 'a'-'z', 'A'-'Z' and '_' have a special meaning. "very magic"
Use of "\V" means that in the pattern after it only the backslash has a
special meaning. "very nomagic"
Examples:
after: \v \m \M \V matches
'magic' 'nomagic'
$ $ $ \$ matches end-of-line
. . \. \. matches any character
* * \* \* any number of the previous atom
() \(\) \(\) \(\) grouping into an atom
| \| \| \| separating alternatives
\a \a \a \a alphabetic character
\\ \\ \\ \\ literal backslash
\. \. . . literal dot
\{ { { { literal '{'
a a a a literal 'a'
{only Vim supports \m, \M, \v and \V}
It is recommended to always keep the 'magic' option at the default setting,
which is 'magic'. This avoids portability problems. To make a pattern immune
to the 'magic' option being set or not, put "\m" or "\M" at the start of the
pattern.
It turns out that I had another syntax rule that was highlighting some additional things in the same color and throwing me off.
In very magic mode, \# does appear to correctly escape the # symbol:
syntax match aldaAtMarker "\v\#[a-zA-Z]{2,}[\w[:digit:]\-+'()]*"

What does the regex "/\\*{2,}/" mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
I'm kinda new to regex, and specifically, I don't understand there are 2 backslashes? I mean, I know the second one is to escape the character "*", but what does the first backslash do?
Well I'm passing this regex expression to the php function preg_match(), and I'm trying to find strings that include 2 or more consecutive "*".
That regex is invalid syntax.
You have this piece:
*{2,}
Which basically would read: match n-times, 2 or more times.
The following regex:
/\\*.{2,}/
Is the simplest and closest regex to the one you have, which would read as:
match 0 or more '\' and 2 or more characters that aren't newlines
If you are talking about the string itself, is may be interpreted as 2 things:
/\\*{2,}/
Read as: match a single \ and another \ n-times 2 times or more
This is invalid syntax
/\*{2,}\
Read as match 2 or more *
This is valid syntax
It all varies, depending on the escape character.
Edit:
Since the question was updated to show which language and engine it is being used, I've updated to add the following information:
You have to pass the regex as '/\*{2,}/' OR as "/\\*{2,}/" (watch the quotes).
Both are very similar, except that single quotes ('') only support the following escape sequences:
\' - Produces '
\\- Produces \
Double-quoted strings are treated differently in PHP. And they support almost any escape sequence, like:
\" - Produces "
\' - Produces '
\\ - Produces \
\x<2-digit hex number> - Same as chr(0x<2-digit hex number>)
\0 - Produces a null char
\1 - Produces a control char (same as chr(1))
\u<4-digit hex number> - Produces an UTF-8 character
\r - Produces a newline on old OSX
\n - Produces a newline on Linux/newer OSX/Windows (when writting a file without b)
\t - Produces a tab
\<number> or \0<number> - Same as \x, but the numbers are in octal (e.g.: "\75" and "\075" produce =)
... (some more that I probably forgot) ...
\<anything> - Produces <anything>
Read more about this on https://php.net/manual/en/language.types.string.php
Depending on the platrofm you're using, "/\\*{2,}/" may actually be a representation of a /\*{2,'}/ string - this is because languages like Java treat \ as an escape character, so to actually put that character within regex, you need to escape the character in regex string.
So, we have /\*{2'}/ regex. \*' matches the star character, and{2,}` means at least two times. Your regex will match any two or more consecutive star characters.
Is it a string literal written in a program and if so which one? The double backslash may be to escape the escape char so that this regex matches at least 2 * star characters.
In JavaScript for example you need to escape the \ so that your string literal can express it as data before you transform it into a regular expression when using the RegExp constructor. Why do regex constructors need to be double escaped?
For PHP what you have with that regex is to repeat literally a * 2 or more times. You can easily see with with below diagram:
But when you have to code it in PHP you have to escape the backslash (with a backslash) to use it in string. For instance:
$re = "/\\*{2,}/";
$str = "...";
preg_match($re, $str, $matches);

RegEx with Pipes and IPs not working

The RegEx:
^([0-9\.]+)\Q|\E([^\Q|\E])\Q|\E
does not match the string:
1203730263.912|12.66.18.0|
Why?
From PHP docs,
\Q and \E can be used to ignore regexp metacharacters in the pattern.
For example:
\w+\Q.$.\E$ will match one or more word characters, followed by literals .$. and anchored at the end of the string.
And your regex should be,
^([0-9\.]+)\Q|\E([^\Q|\E]*)\Q|\E
OR
^([0-9\.]+)\Q|\E([^\Q|\E]+)\Q|\E
You forget to add + after [^\Q|\E]. Without +, it matches single character.
DEMO
Explanation:
^ Starting point.
([0-9\.]+) Captures digits or dot one or more times.
\Q|\E In PCRE, \Q and \E are referred to as Begin sequence. Which treats any character literally when it's included in that block. So | symbol in that block tells the regex engine to match a literal |.
([^\Q|\E]+) Captures any character not of | one or more times.
\Q|\E Matches a literal pipe symbol.
The accepted answer seems somewhat incorrect so I wanted to address this for future readers.
If you did not already know, using \Q and \E ensures that any character between \Q ... \E will be matched literally, not interpreted as a metacharacter by the regular expression engine.
First and most important, \Q and \E is NOT usable within a bracketed character class [].
[^\Q|\E] # Incorrect
[^|] # Correct
Secondly, you do not follow that class with a quantifier. Using this, the correct syntax would be:
^([0-9.]+)\Q|\E([^|]+)\Q|\E
Although, it is much simpler to write this out as:
^([0-9.]+)\|([^|]+)\|

How is \\n and \\\n interpreted by the expanded regular expression?

Within an ERE, a backslash character (\, \a, \b, \f, \n,
\r, \t, \v) is considered to begin an escape sequence.
Then I see \\n and [\\\n], I can guess though both \\n and [\\\n] here means \ followed by new line, but I'm confused by the exact process to interpret such sequence as how many \s are required at all?
UPDATE
I don't have problem understanding regex in programing languages so please make the context within the lexer.
[root# ]# echo "test\
> hi"
This is dependent on the programming language and on its string handling options.
For example, in Java strings, if you need a literal backslash in a string, you need to double it. So the regex \n must be written as "\\n". If you plan to match a backslash using a regex, then you need to escape it twice - once for Java's string handler, and once for the regex engine. So, to match \, the regex is \\, and the corresponding Java string is "\\\\".
Many programming languages have special "verbatim" or "raw" strings where you don't need to escape backslashes. So the regex \n can be written as a normal Python string as "\\n" or as a Python raw string as r"\n". The Python string "\n" is the actual newline character.
This can becoming confusing, because sometimes not escaping the backslash happens to work. For example the Python string "\d\n" happens to work as a regex that's intended to match a digit, followed by a newline. This is because \d isn't a recognized character escape sequence in Python strings, so it's kept as a literal \d and fed that way to the regex engine. The \n is translated to an actual newline, but that happens to match the newline in the string that the regex is tested against.
However, if you forget to escape a backslash where the resulting sequence is a valid character escape sequence, bad things happen. For example, the regex \bfoo\b matches an entire word foo (but it doesn't match the foo in foobar). If you write the regex string as "\bfoo\b", the \bs are translated into backspace characters by the string processor, so the regex engine is told to match <backspace>foo<backspace> which obviously will fail.
Solution: Always use verbatim strings where you have them (e. g. Python's r"...", .NET's #"...") or use regex literals where you have them (e. g. JavaScript's and Ruby's /.../). Or use RegexBuddy to automatically translate the regex for you into your language's special format.
To get back to your examples:
\\n as a regex means "Match a backslash, followed by n"
[\\\n] as a regex means "Match either a backslash or a newline character".
Actually regex string specified by string literal is processed by two compilers: programming language compiler and regexp compiler:
Original Compiled Regex compiled
"\n" NL NL
"\\n" '\'+'n' NL
"\\\n" '\'+NL NL
"\\\\n" '\'+'\'+'n' '\'+'n'
So you must use the shortest format "\n".
Code examples:
JavaScript:
'a\nb'.replace(RegExp("\n"),'<br>')
'a\nb'.replace(RegExp("\\n"),'<br>')
'a\nb'.replace(RegExp("\\\n"),'<br>')
but not:
'a\nb'.replace(/\\\n/,'<br>')
Java:
System.out.println("a\nb".replaceAll("\n","<br>"));
System.out.println("a\nb".replaceAll("\\n","<br>"));
System.out.println("a\nb".replaceAll("\\\n","<br>"));
Python:
str.join('<br>',regex.split('\n','a\nb'))
str.join('<br>',regex.split('\\n','a\nb'))
str.join('<br>',regex.split('\\\n','a\nb'))