What's the difference between [:space:] and [:blank:]? - regex

From the A Brief Introduction to Regular Expressions
[:blank:] matches a space or a tab.
[:space:] matches whitespace characters (space and horizontal tab).
To me both definitions are the same and I was wondering if they are really duplicates?
If they are different, what are the differences?

For the GNU tools the following from grep.info applies:
[:blank:]
Blank characters: space and tab.
[:space:]
Space characters: in the 'C' locale, this is tab, newline,
vertical tab, form feed, carriage return, and space.
You can find the section with this command:
info grep 'Regular Expressions' 'Character Classes and Bracket Expressions'

A better explanation of what they each match is available here
http://www.regular-expressions.info/posixbrackets.html
The biggest difference is that [:space:] will also match items like new line characters

A space means to pressing a space bar and tab
A white space means it carries the newline,tab,form feed and carriage return and also space that's all.

Related

Vim regex to replace zero or one spaces at the end of a line with two spaces [duplicate]

From question How to replace a character for a newline in Vim?. You have to use \r when replacing text for a newline, like this
:%s/%/\r/g
But when replacing end of lines and newlines for a character, you can do it like:
:%s/\n/%/g
What section of the manual documents these behaviors, and what's the reasoning behind them?
From http://vim.wikia.com/wiki/Search_and_replace :
When Searching
...
\n is newline, \r is CR (carriage return = Ctrl-M = ^M)
When Replacing
...
\r is newline, \n is a null byte (0x00).
From vim docs on patterns:
\r matches <CR>
\n matches an end-of-line -
When matching in a string instead of
buffer text a literal newline
character is matched.
Another aspect to this is that \0, which is traditionally NULL, is taken in
s//\0/ to mean "the whole matched pattern". (Which, by the way, is redundant with, and longer than, &).
So you can't use \0 to mean NULL, so you use \n
So you can't use \n to mean \n, so you use \r.
So you can't use \r to mean \r, but I don't know who would want to add that char on purpose.
—☈
:help NL-used-for-Nul
Technical detail:
<Nul> characters in the file are stored as <NL> in memory. In the display
they are shown as "^#". The translation is done when reading and writing
files. To match a <Nul> with a search pattern you can just enter CTRL-# or
"CTRL-V 000". This is probably just what you expect. Internally the
character is replaced with a <NL> in the search pattern. What is unusual is
that typing CTRL-V CTRL-J also inserts a <NL>, thus also searches for a <Nul>
in the file. {Vi cannot handle <Nul> characters in the file at all}
First of all, open :h :s to see the section "4.2 Substitute" of documentation on "Change". Here's what the command accepts:
:[range]s[ubstitute]/{pattern}/{string}/[flags] [count]
Notice the description about pattern and string
For the {pattern} see |pattern|.
{string} can be a literal string, or something
special; see |sub-replace-special|.
So now you know that the search pattern and replacement patterns follow different rules.
If you follow the link to |pattern|, it takes you to the section that explains the whole regexp patterns used in Vim.
Meanwhile, |sub-replace-special| takes you to the subsection of "4.2 Substitute", which contains the patterns for substitution, among which is \r for line break/split.
(The shortcut to this part of manual is :h :s%)

vim: why whitespace character \s not working in bracket expression, but \n do

for example, put /\s in vim will match any whitespace, but /[\s] does not.
I thought it's because backslash escapes are not allowed in bracket expression as stated in https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended, but then I tested /[\n] which correctly match the end-of-line.
Why is this behavior? Is there any document describe this behavior?
Vim uses its own brand of regular expression syntax which is AFAIK not documented on Wikipedia. Like everything Vim, it's documented in Vim itself, :help pattern, and there is a good introduction online too.
Collections contain single characters. Since \s is itself a collection of whitespace characters and not a single character it can't be contained in another collection. If you want to include whitespace characters in a collection, you'll have to include them one-by-one: [ \t].
[\n] works because \n is a single character.
:help pattern
:help /[]
:help /[\n]

Regular expression matching space but at the end of line

I'm trying to replace multiple spaces with a single one, but at the start of the line.
Example:
___abc___def__
___ghi___jkl__
should turn to
___abc_def__
___ghi_jkl__
Note that I've replaced space with underscore
A simple search using the following pattern:
([^\s])\s+
matches the space at the end of the first line up to the space at the beginning of the next one.
So, if I replace with \1_, I get the following:
___abc_def_ghi_jkl
And that is absolutely not what I expect and regex engines, e.g., PowerGREP or the one in Visual Studio, don't behave that way.
If you want to match only horizontal spaces, use \h:
Find what: (?<=\S)\h+(?=\S)
Replace with: (a space)
There are several possible interpretations of the question. For each of them the replacement will be a single space character.
If spaces is plural and means space characters but not tabs then use
a find string of (^ {2,})|( {2,}$).
If spaces is plural and should includes tabs then use a find string
of (^[ \t]{2,})|([ \t]{2,}$).
If any leading or trailing spaces and tabs (one or more) is to be
replaced with a space then use a find string of (^[ \t]+)|([ \t]+$).
The general form of each of these is (^...)|(...$). The | means an alternation so either the preceding or the following bracketed expression can match. Hence the find what text can match either at the beginning or the end of a line. The ... varies depending on exactly what needs to be matched. Specifying [ \t] means only the two characters space and tab, whereas \s includes the line-end characters.
Ok, so the intention was to replace this:
Hey diddle diddle, \n<br/>
The Cat and the fiddle,\n
with this:
Hey diddle diddle,\n<br/>
The Cat and the fiddle,\n
A slightly modified version of Toto's answer did the trick:
(?<=\S)\h+(?=\S)|\s+$
finding any space(s) between word-characters and trailing space at the end of the line.

Removing repeated characters, including spaces, in one line

I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!
This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)
You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.

Regular Expression WhiteSpace Character that Excludes Newline [duplicate]

I sometimes want to match whitespace but not newline.
So far I've been resorting to [ \t]. Is there a less awkward way?
Use a double-negative:
/[^\S\r\n]/
That is, not-not-whitespace (the capital S complements) or not-carriage-return or not-newline. Distributing the outer not (i.e., the complementing ^ in the character class) with De Morgan's law, this is equivalent to “whitespace but not carriage return or newline.” Including both \r and \n in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and DOS-ish (CR LF) newline conventions.
No need to take my word for it:
#! /usr/bin/env perl
use strict;
use warnings;
use 5.005; # for qr//
my $ws_not_crlf = qr/[^\S\r\n]/;
for (' ', '\f', '\t', '\r', '\n') {
my $qq = qq["$_"];
printf "%-4s => %s\n", $qq,
(eval $qq) =~ $ws_not_crlf ? "match" : "no match";
}
Output:
" " => match
"\f" => match
"\t" => match
"\r" => no match
"\n" => no match
Note the exclusion of vertical tab, but this is addressed in v5.18.
Before objecting too harshly, the Perl documentation uses the same technique. A footnote in the “Whitespace” section of perlrecharclass reads
Prior to Perl v5.18, \s did not match the vertical tab. [^\S\cK] (obscurely) matches what \s traditionally did.
The same section of perlrecharclass also suggests other approaches that won’t offend language teachers’ opposition to double-negatives.
Outside locale and Unicode rules or when the /a switch is in effect, “\s matches [\t\n\f\r ] and, starting in Perl v5.18, the vertical tab, \cK.” Discard \r and \n to leave /[\t\f\cK ]/ for matching whitespace but not newline.
If your text is Unicode, use code similar to the sub below to construct a pattern from the table in the aforementioned documentation section.
sub ws_not_nl {
local($_) = <<'EOTable';
0x0009 CHARACTER TABULATION h s
0x000a LINE FEED (LF) vs
0x000b LINE TABULATION vs [1]
0x000c FORM FEED (FF) vs
0x000d CARRIAGE RETURN (CR) vs
0x0020 SPACE h s
0x0085 NEXT LINE (NEL) vs [2]
0x00a0 NO-BREAK SPACE h s [2]
0x1680 OGHAM SPACE MARK h s
0x2000 EN QUAD h s
0x2001 EM QUAD h s
0x2002 EN SPACE h s
0x2003 EM SPACE h s
0x2004 THREE-PER-EM SPACE h s
0x2005 FOUR-PER-EM SPACE h s
0x2006 SIX-PER-EM SPACE h s
0x2007 FIGURE SPACE h s
0x2008 PUNCTUATION SPACE h s
0x2009 THIN SPACE h s
0x200a HAIR SPACE h s
0x2028 LINE SEPARATOR vs
0x2029 PARAGRAPH SEPARATOR vs
0x202f NARROW NO-BREAK SPACE h s
0x205f MEDIUM MATHEMATICAL SPACE h s
0x3000 IDEOGRAPHIC SPACE h s
EOTable
my $class;
while (/^0x([0-9a-f]{4})\s+([A-Z\s]+)/mg) {
my($hex,$name) = ($1,$2);
next if $name =~ /\b(?:CR|NL|NEL|SEPARATOR)\b/;
$class .= "\\N{U+$hex}";
}
qr/[$class]/u;
}
Other Applications
The double-negative trick is also handy for matching alphabetic characters too. Remember that \w matches “word characters,” alphabetic characters and digits and underscore. We ugly-Americans sometimes want to write it as, say,
if (/[A-Za-z]+/) { ... }
but a double-negative character-class can respect the locale:
if (/[^\W\d_]+/) { ... }
Expressing “a word character but not digit or underscore” this way is a bit opaque. A POSIX character-class communicates the intent more directly
if (/[[:alpha:]]+/) { ... }
or with a Unicode property as szbalint suggested
if (/\p{Letter}+/) { ... }
Perl versions 5.10 and later support subsidiary vertical and horizontal character classes, \v and \h, as well as the generic whitespace character class \s
The cleanest solution is to use the horizontal whitespace character class \h. This will match tab and space from the ASCII set, non-breaking space from extended ASCII, or any of these Unicode characters
U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE (not matched by \s)
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
The vertical space pattern \v is less useful, but matches these characters
U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE (not matched by \s)
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
There are seven vertical whitespace characters which match \v and eighteen horizontal ones which match \h. \s matches twenty-three characters
All whitespace characters are either vertical or horizontal with no overlap, but they are not proper subsets because \h also matches U+00A0 NO-BREAK SPACE, and \v also matches U+0085 NEXT LINE, neither of which are matched by \s
A variation on Greg’s answer that includes carriage returns too:
/[^\S\r\n]/
This regex is safer than /[^\S\n]/ with no \r. My reasoning is that Windows uses \r\n for newlines, and Mac OS 9 used \r. You’re unlikely to find \r without \n nowadays, but if you do find it, it couldn’t mean anything but a newline. Thus, since \r can mean a newline, we should exclude it too.
The below regex would match white spaces but not of a new line character.
(?:(?!\n)\s)
DEMO
If you want to add carriage return also then add \r with the | operator inside the negative lookahead.
(?:(?![\n\r])\s)
DEMO
Add + after the non-capturing group to match one or more white spaces.
(?:(?![\n\r])\s)+
DEMO
I don't know why you people failed to mention the POSIX character class [[:blank:]] which matches any horizontal whitespaces (spaces and tabs). This POSIX chracter class would work on BRE(Basic REgular Expressions), ERE(Extended Regular Expression), PCRE(Perl Compatible Regular Expression).
DEMO
What you are looking for is the POSIX blank character class. In Perl it is referenced as:
[[:blank:]]
in Java (don't forget to enable UNICODE_CHARACTER_CLASS):
\p{Blank}
Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).
But the problem is that even sticking to Unicode doesn't solve the issue 100%. Consider the following characters which are not considered whitespace in Unicode:
U+180E MONGOLIAN VOWEL SEPARATOR
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+2060 WORD JOINER
U+FEFF ZERO WIDTH NON-BREAKING SPACE
Taken from https://en.wikipedia.org/wiki/White-space_character
The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.
In Java:
static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"
Put the regex below in the find section and select Regular Expression from "Search Mode":
[^\S\r\n]+
m/ /g just give space in / /, and it will work. Or use \S — it will replace all the special characters like tab, newlines, spaces, and so on.