Vim regex matches unicode characters are as non-word

Vim regex matches unicode characters are as non-word - regex

I have the following text:
üyü
The following regex search matches the characters ü:
/\W
Is there a unicode flag in Vim regex?

Unfortunately, there is no such flag (yet).
Some built-in character classes (can) include multi-byte characters,
others don't. The common \w \a \l \u classes only contain ASCII
letters, so even umlaut characters aren't included in them, leading to
unexpected behavior! See also https://unix.stackexchange.com/a/60600/18876.
In the 'isprint' option (and 'iskeyword', which determines what motions like w move over), multi-byte characters 256 and
above are always included, only extended ASCII characters up to 255 are specified with
this option.

I always use:
ASCII UTF-8
----- -----
\w [a-zA-Z\u0100-\uFFFF]
\W [^a-zA-Z\u0100-\uFFFF]

You can use \%uXXXX to match a multibyte character. In that case…
/\%u00fc
But I'm not aware of a flag that would make the whole matching multibyte-friendly.
Note that with the default value of iskeyword on UNIX systems, ü is matched by \k.

very often I find \S+ takes me where I want to go. i.e:
s/\(\S\+\)\s\+\(\S\+\).*/\1 | \2/ selects "wörd1 w€rd2 but not word3" and replaces the line with "wörd1 | w€rd2"

Related

Regular Expression Remove all English and Arabic numbers?

I want Regular Expression to remove Arabic and english numbers
my varibale is
$variable="12121212ABDHSتشؤآئ۳۳۴۳۴729384234owiswoisw";
i want remove all digits ! LIKE:
ABDHSتشؤآئowiswoisw
I found the following expression but not work !
$newvariable = preg_replace('/^[\u0621-\u064A]+$', '', $variable);
thanks for you helps

You may use
$newvariable = preg_replace('/\d+/u', '', $variable);
See the regex demo
The \d matches ASCII digits by default, but when you add the u modifier, it enables the PCRE_UCP option (together with PCRE_UTF8) that enables \d to match all Unicode digits.
See PCRE documentation:
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
\w, and some of the POSIX character classes. By default, only ASCII
characters are recognized, but if PCRE_UCP is set, Unicode properties
are used instead to classify characters.
You may fix your regex if you need to only restrict matching to ASCII and those of your choice:
preg_replace('/[0-9\u0621-\u064A]+/u', '', $variable)

How to remove all non word characters except ä,ö and ü from a text using RegExp

I have a file and I want to remove all non-word characters from it, with the exception of ä, ö and ü, which are mutated vowels in the German language. Is there a way to do word.gsub!(/\W/, '') and put exceptions in it?
Example:
text = "übung bzw. äffchen"
text.gsub!(/\W/, '').
Now it would return "bungbzwffchen". It deletes the non word characters, but also removes the mutated vowels ü and ä, which I want to keep.

You may be able to define a list of exclusions by using some kind of negative-lookback thing, but the simplest I think would be to just use \w instead of \W and negate the whole group:
word.gsub!(/[^\wÄäÖöÜü]/, '')
You could also use word.gsub(/[^\p{Letter}]/, ''), that should get rid of any characters that are not listed as "Letter" in unicode.
You mention German vowels in your question, I think it's worth noting here that the German alphabet also includes the long-s : ẞ / ß
Update:
To answer your original question, to define a list of exclusions, you use the "negative look-behind" (?<!pat):
word.gsub(/\W(?<![ÄäÖöÅåẞß])/, '')

You could use the && operator within a character class:
text = "übung bzw. äffchen ÄÖÜ"
text.gsub(/[\W&&[^äÄöÖüÜ]]/, '')
#=> "übungbzwäffchenÄÖÜ"
The regular expression reads, "match a character in the set of characters formed by intersecting the set of all non-word characters with the set of all characters other than those in the string "äÄöÖüÜ". See Regexp (search for "&& operator").

In postgreSQL, why is \s treated differently from \w?

Here is the example that confuses me:
select ' w' ~ '^\s\w$';
This results in "false", but seems like it should be true.
select ' w' ~ '^\\s\w*$';
This results in "true", but:
Why does \s need the extra backslash?
If it truly does, why does \w not need the extra backslash?
Thanks for any help!

I think you have tested it the wrong way because I'm getting the opposite results that you got.
select ' w' ~ '^\s\w$';
Is returning 1 in my case. Which actually makes sense because it is matching the space at the beginning of the text, followed by the letter at the end.
select ' w' ~ '^\\s\w*$';
Is returning 0 and it makes sense too. Here you're trying to match a backslash at the beginning of the text followed by an s and then, by any number of letters, numbers or underscores.
A piece of text that would match your second regex would be: '\sw'
Check the fiddle here.

The string constants are first parsed and interpreted as strings, including escaped characters. Escaping of unrecognized sequences is handled differently by different parsers, but generally, besides errors, the most common behavior is to ignore the backslash.
In the first example, the right-hand string constant is first being interpreted as '^sw$', where both \s and \w are not recognized string escape sequences.
In the second example the right hand constant is interpreted as '^\sw*$' where \\s escapes the \
After the strings are interpreted they are then applied as a regular expression, '^\sw*$' matching ' w' where '^sw$' does not.

Some languages use backslash as an escape character. Regexes do that, C-like languages do that, and some rare and odd dialects of SQL do that. PostgresSQL does it. PostgresSQL is translating the backslash escaping to arrive at a string value, and then feeding that string value to the regex parser, which AGAIN translates whatever backslashes survived the first translation -- if any. In your first regex, none did.
For example, in a string literal or a regex, \n doesn't mean a backslash followed by a lowercase n. It means a newline. Depending on the language, a backslash followed by a lowercase s will mean either just a lowercase s, or nothing. In PostgresSQL, an invalid escape sequence in a string literal translates as the escaped character: '\w' translates to 'w'. All the regex parser sees there is the w. By chance, you used the letter w in the string you're matching against. It's not matching that w in the lvalue because it's a word character; it's matching it because it's a lowercase w. Change it to lowercase x and it'll stop matching.
If you want to put a backslash in a string literal, you need to escape it with another backslash: '\\'. This is why \\s in your second regex worked. Add a second backslash to \w if you want to match any word character with that one.
This is a horrible pain. It's why JavaScript, Perl, and other languages have special conventions for regex literals like /\s\w/, and why C# programmers use the #"string literal" feature to disable backslash escaping in strings they intend to use as regexes.

Vim regex not matching spaces in a character class

I'm using vim to do a search and replace with this command:
%s/lambda\s*{\([\n\s\S]\)*//gc
I'm trying to match for all word, endline and whitespace characters after a {. For instance, the entirety of this line should match:
lambda {
FactoryGirl.create ...
Instead, it only matches up to the newline and no spaces before FactoryGirl. I've tried manually replacing all the spaces before, just in case there were tab characters instead, but no dice. Can anyone explain why this doesn't work?

The \s is an atom for whitespace; \n, though it looks similar, syntactically is an escape sequence for a newline character. Inside the collection atom [...], you cannot include other atoms, only characters (including some special ones like \n. From :help /[]:
The following translations are accepted when the 'l' flag is not
included in 'cpoptions' {not in Vi}:
\e <Esc>
\t <Tab>
\r <CR> (NOT end-of-line!)
\b <BS>
\n line break, see above |/[\n]|
\d123 decimal number of character
\o40 octal number of character up to 0377
\x20 hexadecimal number of character up to 0xff
\u20AC hex. number of multibyte character up to 0xffff
\U1234 hex. number of multibyte character up to 0xffffffff
NOTE: The other backslash codes mentioned above do not work inside
[]!
So, either specify the whitespace characters literally [ \t\n...], use the corresponding character class expression [[:space:]...], or combine the atom with the collection via logical or \%(\s\|[...]\).

Vim interprets characters inside of the [ ... ] character classes differently. It's not literally, since that regex wouldn't fully match lambda {sss or lambda {\\\. What \s and \S are interpreted as...I still can't explain.
However, I was able to achieve nearly what I wanted with:
%s/lambda\s*{\([\n a-zA-z]\)*//gc
That ignores punctuation, which I wanted. This works, but is dangerous:
%s/lambda\s*{\([\n a-zA-z]\|.\)*//gc
Because adding on a character after the last character like } causes vim to hang while globbing. So my solution was to add the punctuation I needed into the character class.

Regular Expression WhiteSpace Character that Excludes Newline [duplicate]

I sometimes want to match whitespace but not newline.
So far I've been resorting to [ \t]. Is there a less awkward way?

Use a double-negative:
/[^\S\r\n]/
That is, not-not-whitespace (the capital S complements) or not-carriage-return or not-newline. Distributing the outer not (i.e., the complementing ^ in the character class) with De Morgan's law, this is equivalent to “whitespace but not carriage return or newline.” Including both \r and \n in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and DOS-ish (CR LF) newline conventions.
No need to take my word for it:
#! /usr/bin/env perl
use strict;
use warnings;
use 5.005; # for qr//
my $ws_not_crlf = qr/[^\S\r\n]/;
for (' ', '\f', '\t', '\r', '\n') {
my $qq = qq["$_"];
printf "%-4s => %s\n", $qq,
(eval $qq) =~ $ws_not_crlf ? "match" : "no match";
}
Output:
" " => match
"\f" => match
"\t" => match
"\r" => no match
"\n" => no match
Note the exclusion of vertical tab, but this is addressed in v5.18.
Before objecting too harshly, the Perl documentation uses the same technique. A footnote in the “Whitespace” section of perlrecharclass reads
Prior to Perl v5.18, \s did not match the vertical tab. [^\S\cK] (obscurely) matches what \s traditionally did.
The same section of perlrecharclass also suggests other approaches that won’t offend language teachers’ opposition to double-negatives.
Outside locale and Unicode rules or when the /a switch is in effect, “\s matches [\t\n\f\r ] and, starting in Perl v5.18, the vertical tab, \cK.” Discard \r and \n to leave /[\t\f\cK ]/ for matching whitespace but not newline.
If your text is Unicode, use code similar to the sub below to construct a pattern from the table in the aforementioned documentation section.
sub ws_not_nl {
local($_) = <<'EOTable';
0x0009 CHARACTER TABULATION h s
0x000a LINE FEED (LF) vs
0x000b LINE TABULATION vs [1]
0x000c FORM FEED (FF) vs
0x000d CARRIAGE RETURN (CR) vs
0x0020 SPACE h s
0x0085 NEXT LINE (NEL) vs [2]
0x00a0 NO-BREAK SPACE h s [2]
0x1680 OGHAM SPACE MARK h s
0x2000 EN QUAD h s
0x2001 EM QUAD h s
0x2002 EN SPACE h s
0x2003 EM SPACE h s
0x2004 THREE-PER-EM SPACE h s
0x2005 FOUR-PER-EM SPACE h s
0x2006 SIX-PER-EM SPACE h s
0x2007 FIGURE SPACE h s
0x2008 PUNCTUATION SPACE h s
0x2009 THIN SPACE h s
0x200a HAIR SPACE h s
0x2028 LINE SEPARATOR vs
0x2029 PARAGRAPH SEPARATOR vs
0x202f NARROW NO-BREAK SPACE h s
0x205f MEDIUM MATHEMATICAL SPACE h s
0x3000 IDEOGRAPHIC SPACE h s
EOTable
my $class;
while (/^0x([0-9a-f]{4})\s+([A-Z\s]+)/mg) {
my($hex,$name) = ($1,$2);
next if $name =~ /\b(?:CR|NL|NEL|SEPARATOR)\b/;
$class .= "\\N{U+$hex}";
}
qr/[$class]/u;
}
Other Applications
The double-negative trick is also handy for matching alphabetic characters too. Remember that \w matches “word characters,” alphabetic characters and digits and underscore. We ugly-Americans sometimes want to write it as, say,
if (/[A-Za-z]+/) { ... }
but a double-negative character-class can respect the locale:
if (/[^\W\d_]+/) { ... }
Expressing “a word character but not digit or underscore” this way is a bit opaque. A POSIX character-class communicates the intent more directly
if (/[[:alpha:]]+/) { ... }
or with a Unicode property as szbalint suggested
if (/\p{Letter}+/) { ... }

Perl versions 5.10 and later support subsidiary vertical and horizontal character classes, \v and \h, as well as the generic whitespace character class \s
The cleanest solution is to use the horizontal whitespace character class \h. This will match tab and space from the ASCII set, non-breaking space from extended ASCII, or any of these Unicode characters
U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE (not matched by \s)
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
The vertical space pattern \v is less useful, but matches these characters
U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE (not matched by \s)
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
There are seven vertical whitespace characters which match \v and eighteen horizontal ones which match \h. \s matches twenty-three characters
All whitespace characters are either vertical or horizontal with no overlap, but they are not proper subsets because \h also matches U+00A0 NO-BREAK SPACE, and \v also matches U+0085 NEXT LINE, neither of which are matched by \s

A variation on Greg’s answer that includes carriage returns too:
/[^\S\r\n]/
This regex is safer than /[^\S\n]/ with no \r. My reasoning is that Windows uses \r\n for newlines, and Mac OS 9 used \r. You’re unlikely to find \r without \n nowadays, but if you do find it, it couldn’t mean anything but a newline. Thus, since \r can mean a newline, we should exclude it too.

The below regex would match white spaces but not of a new line character.
(?:(?!\n)\s)
DEMO
If you want to add carriage return also then add \r with the | operator inside the negative lookahead.
(?:(?![\n\r])\s)
DEMO
Add + after the non-capturing group to match one or more white spaces.
(?:(?![\n\r])\s)+
DEMO
I don't know why you people failed to mention the POSIX character class [[:blank:]] which matches any horizontal whitespaces (spaces and tabs). This POSIX chracter class would work on BRE(Basic REgular Expressions), ERE(Extended Regular Expression), PCRE(Perl Compatible Regular Expression).
DEMO

What you are looking for is the POSIX blank character class. In Perl it is referenced as:
[[:blank:]]
in Java (don't forget to enable UNICODE_CHARACTER_CLASS):
\p{Blank}
Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).
But the problem is that even sticking to Unicode doesn't solve the issue 100%. Consider the following characters which are not considered whitespace in Unicode:
U+180E MONGOLIAN VOWEL SEPARATOR
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+2060 WORD JOINER
U+FEFF ZERO WIDTH NON-BREAKING SPACE
Taken from https://en.wikipedia.org/wiki/White-space_character
The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.
In Java:
static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"

Put the regex below in the find section and select Regular Expression from "Search Mode":
[^\S\r\n]+

m/ /g just give space in / /, and it will work. Or use \S — it will replace all the special characters like tab, newlines, spaces, and so on.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js