I'm using vim to do a search and replace with this command:
%s/lambda\s*{\([\n\s\S]\)*//gc
I'm trying to match for all word, endline and whitespace characters after a {. For instance, the entirety of this line should match:
lambda {
FactoryGirl.create ...
Instead, it only matches up to the newline and no spaces before FactoryGirl. I've tried manually replacing all the spaces before, just in case there were tab characters instead, but no dice. Can anyone explain why this doesn't work?
The \s is an atom for whitespace; \n, though it looks similar, syntactically is an escape sequence for a newline character. Inside the collection atom [...], you cannot include other atoms, only characters (including some special ones like \n. From :help /[]:
The following translations are accepted when the 'l' flag is not
included in 'cpoptions' {not in Vi}:
\e <Esc>
\t <Tab>
\r <CR> (NOT end-of-line!)
\b <BS>
\n line break, see above |/[\n]|
\d123 decimal number of character
\o40 octal number of character up to 0377
\x20 hexadecimal number of character up to 0xff
\u20AC hex. number of multibyte character up to 0xffff
\U1234 hex. number of multibyte character up to 0xffffffff
NOTE: The other backslash codes mentioned above do not work inside
[]!
So, either specify the whitespace characters literally [ \t\n...], use the corresponding character class expression [[:space:]...], or combine the atom with the collection via logical or \%(\s\|[...]\).
Vim interprets characters inside of the [ ... ] character classes differently. It's not literally, since that regex wouldn't fully match lambda {sss or lambda {\\\. What \s and \S are interpreted as...I still can't explain.
However, I was able to achieve nearly what I wanted with:
%s/lambda\s*{\([\n a-zA-z]\)*//gc
That ignores punctuation, which I wanted. This works, but is dangerous:
%s/lambda\s*{\([\n a-zA-z]\|.\)*//gc
Because adding on a character after the last character like } causes vim to hang while globbing. So my solution was to add the punctuation I needed into the character class.
Related
From question How to replace a character for a newline in Vim?. You have to use \r when replacing text for a newline, like this
:%s/%/\r/g
But when replacing end of lines and newlines for a character, you can do it like:
:%s/\n/%/g
What section of the manual documents these behaviors, and what's the reasoning behind them?
From http://vim.wikia.com/wiki/Search_and_replace :
When Searching
...
\n is newline, \r is CR (carriage return = Ctrl-M = ^M)
When Replacing
...
\r is newline, \n is a null byte (0x00).
From vim docs on patterns:
\r matches <CR>
\n matches an end-of-line -
When matching in a string instead of
buffer text a literal newline
character is matched.
Another aspect to this is that \0, which is traditionally NULL, is taken in
s//\0/ to mean "the whole matched pattern". (Which, by the way, is redundant with, and longer than, &).
So you can't use \0 to mean NULL, so you use \n
So you can't use \n to mean \n, so you use \r.
So you can't use \r to mean \r, but I don't know who would want to add that char on purpose.
—☈
:help NL-used-for-Nul
Technical detail:
<Nul> characters in the file are stored as <NL> in memory. In the display
they are shown as "^#". The translation is done when reading and writing
files. To match a <Nul> with a search pattern you can just enter CTRL-# or
"CTRL-V 000". This is probably just what you expect. Internally the
character is replaced with a <NL> in the search pattern. What is unusual is
that typing CTRL-V CTRL-J also inserts a <NL>, thus also searches for a <Nul>
in the file. {Vi cannot handle <Nul> characters in the file at all}
First of all, open :h :s to see the section "4.2 Substitute" of documentation on "Change". Here's what the command accepts:
:[range]s[ubstitute]/{pattern}/{string}/[flags] [count]
Notice the description about pattern and string
For the {pattern} see |pattern|.
{string} can be a literal string, or something
special; see |sub-replace-special|.
So now you know that the search pattern and replacement patterns follow different rules.
If you follow the link to |pattern|, it takes you to the section that explains the whole regexp patterns used in Vim.
Meanwhile, |sub-replace-special| takes you to the subsection of "4.2 Substitute", which contains the patterns for substitution, among which is \r for line break/split.
(The shortcut to this part of manual is :h :s%)
/\/\*[ \t]*\./ /import/i /[ \t\w\/\.\=\-;\[\]\$>"']+\*\/[ \t]*[\n\r]{1,2}/
In the above regular expression, I don't know the meaning of [ \t\w\/\.\=\-;\[\]\$>"']+
which type of data syntax its going to handle.
Can any one please explain me with an example data?
Your characters are inside a Character Class, which means..
[ \t\w\/\.\=\-;\[\]\$>"']+
Any character of:
' ', \t (tab)
word characters (a-z, A-Z, 0-9, _)
\/, \., \=, \-, ;, \[, \], \$, >, ", '
(1 or more times)
In a regular expression characters that are to be interpreted literally rather than as metacharacters can be escaped by preceding them with a backslash symbol (\). Therefore, If you want to use any of these characters as a literal in a regular expression, you need to escape them with a backslash.
For PCRE, and most other Perl-compatible flavors, escape these inside of character classes:
^]\-
And escape these outside of character classes:
^.*+?$|()[{\
Note: The hyphen does not necessarily need escaped if it's considered the first or last character of range inside of the character class.
So basically, this could be simplified to the following.
[ \t\w\/.=;[\]$>"'-]+
To escape a character means to not use its common role, but its special role (if it has one).
For example, the common role for the letter "w" is a simple character "w" inside or outside a character class.
If the character "w" is escaped by putting a \ character before it, \w will have a special role and means any "word" character (letters, digits and _ character) inside or outside a character class.
The common role for the character "]" is not the simple character "]", but it has the role of ending a character class.
If the character "]" is escaped by placing a \ before it, ] will have a special role and it will mean this time a simple "]" character inside or outside a character class.
Outside a character class some characters like "$", "*", "?", "+" have another role than their simple character values, so when you want to specify a plus sign symbol for example, you need to escape it using + because otherwise its common role will be to mean "the previous character one or more times".
Inside a character class however, some of the characters are always used as common characters, so they don't need to be escaped. So for example you don't need to use \= * + \? inside a character class, but only = * + ?.
Inside a character class you need to escape however some characters like "]" because otherwise it will mean the end of the character class.
You also need to escape the character "-" because otherwise it will not be treated as a simple dash, but it will create a range between previous and next characters.
The alternative is to always place the "-" character as the first or the last character in the character class, and in that case it doesn't need to be escaped.
It may look to be complicated, but actually it is not.
You need to think logicly. What happends if you don't escape the "+" character when it appears in a character class? Can it mean that the previous character may appear once or for more times? It wouldn't have any sense such a thing in a character class, so you don't need to escape it. The "=" character don't have any special role insider or outside a character class, so you don't need to escape it either.
The simple dot "." outside a character class means any character but not \n unless the /s modifier is used), but in a character class its common meaning is to act as a simple dot (.) so you don't need to escape it either.
These are not all details regarding the common and special meanings of all characters but I gave them only as examples to show what escaping means.
I have the following text:
üyü
The following regex search matches the characters ü:
/\W
Is there a unicode flag in Vim regex?
Unfortunately, there is no such flag (yet).
Some built-in character classes (can) include multi-byte characters,
others don't. The common \w \a \l \u classes only contain ASCII
letters, so even umlaut characters aren't included in them, leading to
unexpected behavior! See also https://unix.stackexchange.com/a/60600/18876.
In the 'isprint' option (and 'iskeyword', which determines what motions like w move over), multi-byte characters 256 and
above are always included, only extended ASCII characters up to 255 are specified with
this option.
I always use:
ASCII UTF-8
----- -----
\w [a-zA-Z\u0100-\uFFFF]
\W [^a-zA-Z\u0100-\uFFFF]
You can use \%uXXXX to match a multibyte character. In that case…
/\%u00fc
But I'm not aware of a flag that would make the whole matching multibyte-friendly.
Note that with the default value of iskeyword on UNIX systems, ü is matched by \k.
very often I find \S+ takes me where I want to go. i.e:
s/\(\S\+\)\s\+\(\S\+\).*/\1 | \2/ selects "wörd1 w€rd2 but not word3" and replaces the line with "wörd1 | w€rd2"
I can only find negative lookbehind for this , something like (?<!\\).
But this won't compile in c++ and flex. It seems like both regex.h nor flex support this?
I am trying to implement a shell which has to get treat special char like >, < of | as normal argument string if preceded by backslash. In other word, only treat special char as special if not preceded by 0 or even number of '\'
So echo \\>a or echo abc>a should direct output to a
but echo \>a should print >a
What regular expression should I use?
I'm using flex and yacc to parse the input.
In a Flex rule file, you'd use \\ to match a single backslash '\' character. This is because the \ is used as an escape character in Flex.
BACKSLASH \\
LITERAL_BACKSLASH \\\\
LITERAL_LESSTHAN \\\\<
LITERAL_GREATERTHAN \\\\>
LITERAL_VERTICALBAR \\\\|
If I follow you correctly, in your case you want "\>" to be treated as literal '>' but "\\>" to be treated as literal '\' followed by special redirect. You don't need negative look behind or anything particularly special to accomplish this as you can build one rule that would accept both your regular argument characters and also the literal versions of your special characters.
For purposes of discussion, let's assume that your argument/parameter can contain any character but ' ', '\t', and the special forms of '>', '<', '|'. The rule for the argument would then be something like:
ARGUMENT ([^ \t\\><|]|\\\\|\\>|\\<|\\\|)+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\\\ matches any instance of "\" (i.e. a literal backslash)
\\> matches any instance of ">" (i.e. a literal greater than)
\\< matches any instance of "\<" (i.e. a literal less than)
\\\| matches any instance of "\|" (i.e. a literal vertical bar/pipe)
Actually... You can probably just shorten that rule to:
ARGUMENT ([^ \t\\><|]|\\[^ \t\r\n])+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\[^ \t\r\n] matches any character preceded by a '\' in your input except for whitespace (which will handle all of your special characters and allow for literal forms of all other characters)
If you want to allow for literal whitespace in your arguments/parameters then you could shorten the rule even further but be careful with using \\. for the second half of the rule alternation as it may or may not match " \n" (i.e. eat your trailing command terminator character!).
Hope that helps!
You cannot easily extract single escaped characters from a command-line, since you will not know the context of the character. In the simplest case, consider the following:
LessThan:\<
BackslashFrom:\\<
In the first one, < is an escaped character; in the second one, it is not. If your language includes quotes (as most shells do), things become even more complicated. It's a lot better to parse the string left to right, one entity at a time. (I'd use flex myself, because I've stopped wasting my time writing and testing lexers, but you might have some pedagogical reason to do so.)
If you really need to find a special character which shouldn't be special, just search for it (in C++98, where you don't have raw literals, you'll have to escape all of the backslashes):
regex: (\\\\)*\\[<>|]
(An even number -- possibly 0 -- of \, then a \ and a <, > or |)
as a C string => "(\\\\\\\\)*\\\\[<>|]"
I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!
This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)
You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.