regular expression confusion \s and " "

regular expression confusion \s and " " - regex

In regular expression, i know when use \s to represent a space, but, in following case, would they be different:
/a\sb/ ---with a \s
/a b/ ---with empty field
thanks a lot if you can explain to me.

The \s character class matches all "whitespace characters," not just spaces. This includes tabs (\t), and if multiline matching is allowed, it includes carriage return (\r) and newline (\n). Theoretically, if your regular expression engine handles unicode, there are also unicode whitespace characters that \s can match, though your mileage may vary.
So with a string like "a\t b", you can match it with the regex /a\s+b/, in case that is useful to you.

Related

Regex for \n not surrounded with double quotes

How do I match all newline chars \n that are not surrounded by " in regex.
For example:
It should match each of these newlines:
"2547026616"
79587329
,"A","2547026616
"79587329","
But shouldn't match these (except for the last \n):
"254702661679587329"
"A"
"254702661679587329"

This regex might be what you're looking for:
(?:(?<!")\n(?!"))|(?:(?<=")\n(?!"))|(?:(?<!")\n(?="))
This will catch any newline that is not preceded by a quotation mark and/or not followed by a quotation mark, it uses negative and positive look-ahead and look-behind, you can read about them more here. Depending on your programming language these might not be supported, and/or you might need to do some escaping in the regex.
Here is an example on regex101, I've used x\n but this is just to make it easy to display the matches.

Try this:
([^"]\n.?|.?\n[^"])
It matches the 2 characters surrounding the \n as well (if present)
If you want also the last line to end with a double quote, even if it is not followed by \n, then extend the regexp to:
([^"](\n.?|$)|.?\n[^"])
... and if the first line must start with a double quote, then extend to:
([^"](\n.?|$)|(^|.?\n)[^"])

vim: why whitespace character \s not working in bracket expression, but \n do

for example, put /\s in vim will match any whitespace, but /[\s] does not.
I thought it's because backslash escapes are not allowed in bracket expression as stated in https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended, but then I tested /[\n] which correctly match the end-of-line.
Why is this behavior? Is there any document describe this behavior?

Vim uses its own brand of regular expression syntax which is AFAIK not documented on Wikipedia. Like everything Vim, it's documented in Vim itself, :help pattern, and there is a good introduction online too.
Collections contain single characters. Since \s is itself a collection of whitespace characters and not a single character it can't be contained in another collection. If you want to include whitespace characters in a collection, you'll have to include them one-by-one: [ \t].
[\n] works because \n is a single character.
:help pattern
:help /[]
:help /[\n]

Substitution with \s does not work as expected

I write regex to remove more than 1 space in a string. The code is simple:
my $string = 'A string has more than 1 space';
$string = s/\s+/\s/g;
But, the result is something bad: 'Asstringshassmoresthans1sspace'. It replaces every single space with 's' character.
There's a work around is instead of using \s for substitution, I use ' '. So the regex becomes:
$string = s/\s+/ /g;
Why doesn't the regex with \s work?

\s is only a metacharacter in a regular expression (and it matches more than just a space, for example tabs, linebreak and form feed characters), not in a replacement string. Use a simple space (as you already did) if you want to replace all whitespace by a single space:
$string = s/\s+/ /g;
If you only want to affect actual space characters, use
$string = s/ {2,}/ /g;
(no need to replace single spaces with themselves).

The answer to your question is that \s is a character class, not a literal character. Just as \w represents alphanumeric characters, it cannot be used to print an alphanumeric character (except w, which it will print, but that's beside the point).
What I would do, if I wanted to preserve the type of whitespace matched, would be:
s/\s\K\s*//g
The \K (keep) escape sequence will keep the initial whitespace character from being removed, but all subsequent whitespace will be removed. If you do not care about preserving the type of whitespace, the solution already given by Tim is the way to go, i.e.:
s/\s+/ /g

\s stands for matching any whitespace. It's equivalent to this:
[\ \t\r\n\f]
When you replace with $string = s/\s+/\s/g;, you are replacing one or more whitespace characters with the letter s. Here's a link for reference: http://perldoc.perl.org/perlrequick.html

Why doesn't the regex with \s work?
Your regex with \s does work. What doesn't work is your replacement string. And, of course, as others have pointed out, it shouldn't.
People get confused about the substitution operator (s/.../.../). Often I find people think of the whole operator as "a regex". But it's not, it's an operator that takes two arguments (or operands).
The first operand (between the first and second delimiters) is interpreted as a regex. The second operand (between the second and third delimiters) is interpreted as a double-quoted string (of course, the /e option changes that slightly).
So a substitution operation looks like this:
s/REGEX/REPLACEMENT STRING/
The regex recognises special characters like ^ and + and \s. The replacement string doesn't.
If people stopped misunderstanding how the substitution operator is made up, they might stop expecting regex features to work outside of regular expressions :-)

Regular Expression:- String can contain any characters but should not be empty

My requirement is
"A string should not be blank or empty"
Eg., A String can contain any number of characters or strings followed by any special characters but should never be empty for eg., a string can contain "a,b,c" or "xyz123abc" or "12!#$#%&*()9" or " aa bb cc "
So, this is what i tried
Regex for blank or space:-
^\s*$
^ is the beginning of string anchor
$ is the end of string anchor
\s is the whitespace character class
* is zero-or-more repetition of
I'm stuck on how to negate the regex ^\s*$ so that it accepts any string like "a,b,c" or "xyz" or "12!#$#%&*()9"
Any help is appreciated.

No need for a regex. In Groovy you have the isAllWhitespace method:
groovy:000> "".allWhitespace
===> true
groovy:000> " \t\n ".allWhitespace
===> true
groovy:000> "something".allWhitespace
===> false
So asking !yourString.allWhitespace should tell you if your string is something else than empty or blank :)

\S
\S matches any non-white space character
Each character class has it's own anti-class defined, so for \w you have \W for \s you have \S for \d you have \D etc.
http://www.regular-expressions.info/charclass.html
Your regex engine may not support \S. If this is the case you use [^ \t\v] if you support unicode (which you should) there are more space types that you should watch for.
If both your regex engine and you support unicode AND \S is not supported by your regex engine then you'll probably want to use (if you care about people entering different unicode space types):
[^ \r\f\t\v\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u2028\u2029\u202F\u205F\u3000\uFEFF]
http://www.cs.tut.fi/~jkorpela/chars/spaces.html
http://en.wikipedia.org/wiki/Whitespace_character#Unicode

to me two simple ways to express it are (both no need for anchoring):
s.trim() =~ /.+/
or
s =~ /\S+/
the first assumes you know how trim() works, the second assumes the meaning of \S.
Of course
!s.allWhitespace
is perfect, again if you know it exists

The following regular expression will ensure that a string contains at least 1 non-whitespace character.
^(?!\s*$).+
Note: I am not familiar with groovy. But I would imagine there is a native functions (trim, empty, etc) that test this more naturally than a regular expression.

is this in a grails domain class?
if so, just use the blank constraint

Removing repeated characters, including spaces, in one line

I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!

This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)

You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regular expression confusion \s and " " - regex

In regular expression, i know when use \s to represent a space, but, in following case, would they be different: /a\sb/ ---with a \s /a b/ ---with empty field thanks a lot if you can explain to me.

Related

Regex for \n not surrounded with double quotes

vim: why whitespace character \s not working in bracket expression, but \n do

Substitution with \s does not work as expected

Regular Expression:- String can contain any characters but should not be empty

Removing repeated characters, including spaces, in one line

Categories

Resources