Substitution with \s does not work as expected

Substitution with \s does not work as expected - regex

I write regex to remove more than 1 space in a string. The code is simple:
my $string = 'A string has more than 1 space';
$string = s/\s+/\s/g;
But, the result is something bad: 'Asstringshassmoresthans1sspace'. It replaces every single space with 's' character.
There's a work around is instead of using \s for substitution, I use ' '. So the regex becomes:
$string = s/\s+/ /g;
Why doesn't the regex with \s work?

\s is only a metacharacter in a regular expression (and it matches more than just a space, for example tabs, linebreak and form feed characters), not in a replacement string. Use a simple space (as you already did) if you want to replace all whitespace by a single space:
$string = s/\s+/ /g;
If you only want to affect actual space characters, use
$string = s/ {2,}/ /g;
(no need to replace single spaces with themselves).

The answer to your question is that \s is a character class, not a literal character. Just as \w represents alphanumeric characters, it cannot be used to print an alphanumeric character (except w, which it will print, but that's beside the point).
What I would do, if I wanted to preserve the type of whitespace matched, would be:
s/\s\K\s*//g
The \K (keep) escape sequence will keep the initial whitespace character from being removed, but all subsequent whitespace will be removed. If you do not care about preserving the type of whitespace, the solution already given by Tim is the way to go, i.e.:
s/\s+/ /g

\s stands for matching any whitespace. It's equivalent to this:
[\ \t\r\n\f]
When you replace with $string = s/\s+/\s/g;, you are replacing one or more whitespace characters with the letter s. Here's a link for reference: http://perldoc.perl.org/perlrequick.html

Why doesn't the regex with \s work?
Your regex with \s does work. What doesn't work is your replacement string. And, of course, as others have pointed out, it shouldn't.
People get confused about the substitution operator (s/.../.../). Often I find people think of the whole operator as "a regex". But it's not, it's an operator that takes two arguments (or operands).
The first operand (between the first and second delimiters) is interpreted as a regex. The second operand (between the second and third delimiters) is interpreted as a double-quoted string (of course, the /e option changes that slightly).
So a substitution operation looks like this:
s/REGEX/REPLACEMENT STRING/
The regex recognises special characters like ^ and + and \s. The replacement string doesn't.
If people stopped misunderstanding how the substitution operator is made up, they might stop expecting regex features to work outside of regular expressions :-)

Related

Eliminate whitespace around single letters

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...

Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;

Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
(\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
(?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
Then put back the character we captured with $1
It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

perl regex to remove initial all-whitespace lines from a string: why does it work?

The regex s/\A\s*\n// removes every all-whitespace line from the beginning of a string.
It leaves everything else alone, including any whitespace that might begin the first visible line.
By "visible line," I mean a line that satisfies /\S/.
The code below demonstrates this.
But how does it work?
\A anchors the start of the string
\s* greedily grabs all whitespace. But without the (?s) modifier, it should stop at the end of the first line, should it not?
See
https://perldoc.perl.org/perlre.
Suppose that without the (?s) modifier it nevertheless "treats the string as a single line".
Then I would expect the greedy \s* to grab every whitespace character it sees,
including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.
Nevertheless, the code does exactly what I want. Since I can't explain it, it's like a kludge, something that happens to work, discovered through trial and error. What is the reason it works?
#!/usr/bin/env perl
use strict; use warnings;
print $^V; print "\n";
my #strs=(
join('',"\n", "\t", ' ', "\n", "\t", ' dogs',),
join('',
"\n",
"\n\t\t\x20",
"\n\t\t\x20",
'......so what?',
"\n\t\t\x20",
),
);
my $count=0;
for my $onestring(#strs)
{
$count++;
print "\n$count ------------------------------------------\n";
print "|$onestring|\n";
(my $try1=$onestring)=~s/\A\s*\n//;
print "|$try1|\n";
}

But how does it work?
...
I would expect the greedy \s* to grab every whitespace character it sees, including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.
Correct -- the \s* at first grabs everything up to the d (in dogs) and with that the match would fail ... so it backs up, a character at a time, shortening that greedy grab so to give a chance to the following pattern, here \n, to match.
And that works! So \s* matches up to (the last!) \n, that one is matched by the following \n in the pattern, and all is well. That's removed and we stay with "\tdogs" which is printed.
This is called backtracking. See about it also in perlretut. Backtracking can be suppressed, most notably by possesive forms (like \w++ etc), or rather by extended construct (?>...).
But without the (?s) modifier, it should stop at the end of the first line, should it not?
Here you may be confusing \s with ., which indeed does not match \n (without /s)

There are two questions here.
The first is about the interaction of \s and (lack of) (?s). Quite simply, there is no interaction.
\s matches whitespaces characters, which includes Line Feed (LF). It's not affected by (?s) whatsoever.
(?s) exclusively affects ..
(?-s) causes . to match all characters except LF. [Default]
(?s) causes . to match all characters.
If one wanted to match whitespace on the current line, one could use \h instead of \s. It only matches horizontal whitespace, thus excluding CR and LF (among others).
Alternatively, (?[ \s - \n ])[1], [^\S\n][2] and \s(?<!\n)[3] all match whitespace characters other than LF.
The second is about a misconception of what greediness means.
Greediness or lack thereof doesn't affect if a pattern can match, just what it matches. For example, for a given input, /a+/ and /a+?/ will both match, or neither will match. It's impossible for one to match and not the other.
"aaaa" =~ /a+/ # Matches 4 characters at position 0.
"aaaa" =~ /a+?/ # Matches 1 character at position 0.
"bbbb" =~ /a+/ # Doesn't match.
"bbbb" =~ /a+?/ # Doesn't match.
When something is greedy, it means it will match the most possible at the current position that allows the entire pattern to match. Take the following for example:
"ccccd" =~ /.*d/
This pattern can match by having .* match only cccc instead of ccccd, and thus does so. This is achieved through backtracking. .* initially matches ccccd, then it discovers that d doesn't match, so .* tries matching only cccc. This allows the d and thus the entire pattern to match.
You'll find backtracking used outside of greediness too. "efg" =~ /^(e|.f)g/ matches because it tries the second alternative when it's unable to match g when using the first alternative.
In the same way as .* avoids matching the d in the earlier example, the \s* avoids matching the LF and tab before dog in your example.
Requires use experimental qw( regex_sets ); before 5.36, but it was safe to use since 5.18 as it was accepted without change since its introduction as an experimental feature..
Less clear because it uses double negatives.[^\S\n]= A char that's ( not( not(\s) or LF ) )= A char that's ( not(not(\s)) and not(LF) )= A char that's ( \s and not LF )
Less efficient, and far from as pretty as the regex set.

$cmd =~ s#-fp [^ ]+##; What does it mean in Perl?

$cmd =~ s#-fp [^ ]+##;
Is there anyone who let me know what this regex means in Perl?
I couldn't find any regex like above through googling...

This removes the -fp optional parameter and its value from the command.
This takes the string stored by variable $cmd and replaces a section matching -fp [^ ]+ with nothing.
This command is employing the fact that Perl subsitution (or other regex modifiers) can have any delimiter character. What is normally written as s/.../.../ is s#...#...# here. That may be the source of confusion.
=~ is a binary binding operator which takes the left argument as the string to perform the right argument argument on, in this case a substitution.
-fp [^ ]+
-fp matches literally.
[^ ]+ matches one or more characters which are not space.

Let's get the easy bit out of the way first. The $cmd =~ simply means "do the substitution on the variable $cmd".
Not all of this expression is a regex. It's actually the substitution operator - s/REGEX/STRING/. It matches the REGEX and replaces it with the STRING.
Like many similar operators in Perl, the substitution operator allows you to choose the delimiter character that you use. In this case, the programmer has made the slightly bizarre choice to use #.
So, we have this:
$cmd =~ s/-fp [^ ]+//;
And we now know that it means. "Match the variable $cmd against the regex -fp [^ ]+ and replace it with an empty string". Why an empty string? Because the replacement string bit (between the second and third /) is an empty string.
All we need to do now is to understand the actual regex - -fp [^ ]+. And it's not very complicated.
-fp - the first four characters (up to and including the space) match themselves. So this matches the literal string "-fp ".
[^ ] - this is a "character class". Normally, it means "match any of the characters inside [...]". But the ^ at the start inverts that meaning to "match any characters expect the ones between [^...]. So this is match anything that isn't a space.
+ - this is a modifier that means "match one or more of the previous expression".
So, put together, this is "match the string '-fp ' followed by one or more non-space characters.
And, adding in the rest of the expression, we get:
Look at the string in $cmd, if you find the string '-fp -' followed by one or more non-space characters, then replace the matched portion with an empty string.

What do these qr{} regular expressions mean?

What do these mean?
qr{^\Q$1\E[a-zA-Z0-9_\-]*\Q$2\E$}i
qr{^[a-zA-Z0-9_\-]*\Q$1\E$}i
If $pattern is a Perl regular expression, what is $identity in the code below?
$identity =~ $pattern;

When the RHS of =~ isn't m//, s/// or tr///, a match operator (m//) is implied.
$identity =~ $pattern;
is the same as
$identity =~ /$pattern/;
It matches the pattern or pre-compiled regex $pattern (qr//) against the value of $identity.

The binding operator =~ applies a regex to a string variable. This is documented in perldoc perlop
The \Q ... \E escape sequence is a way to quote meta characters (also documented in perlop). It allows for variable interpolation, though, which is why you can use it here with $1 and $2. However, using those variables inside a regex is somewhat iffy, because they themselves are defined during the use of a capture inside a regex.
The character class bracket [ ... ] defines a range of characters which it will match. The quantifier that follows it * means that particular bracket must match zero or more times. The dashes denote ranges, such as a-z meaning "from a through z". The escaped dash \- means a literal dash.
The ^ and $ (the dollar sign at the end) denotes anchors, beginning and end of string respectively. The modifier i at the end means the match is case insensitive.
In your example, $identity is a variable that presumably contains a string (or whatever it contains will be converted to a string).

The perlre documentation is your friend here. Search it for unfamiliar regex constructs.
A detailed explanation is below, but it is so hairy that I wonder whether using a module such as Text::Balanced would be a superior approach.
The first pattern matches possibly empty delimited strings, and the delimiters are in $1 and $2, which we do not know until runtime. Say $1 is ( and $2 is ), then the first pattern matches strings of the form
()
(a)
(9)
(abcABC_012-)
and so on …
The second pattern matches terminated strings, where the terminator is in $1—also not known until runtime. Assuming the terminator is ], then the second pattern matches strings of the form
]
a]
Aa9a_9]
Using \Q...\E around a pattern removes any special regex meaning from the characters inside, as documented in perlop:
For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed. This allows the pattern to match literally (except for $ and #). For example, the following matches:
'\s\t' =~ /\Q\s\t/
Because $ or # trigger interpolation, you'll need to use something like /\Quser\E\#\Qhost/ to match them literally.
The patterns in your question do want to trigger interpolation but do not want any regex metacharacters to have special meaning, as with parentheses and square brackets above that are meant to match literally.
Other parts:
Circumscribed brackets delimit a character class. For example, [a-zA-Z0-9_\-] matches any single character that is upper- or lowercase A through Z (but with no accents or other extras), zero through nine, underscore, or hyphen. Note that the hyphen is escaped at the end to emphasize that it matches a literal hyphen rather and does not specify part of a range.
The * quantifier means match zero or more of the preceding subpattern. In the examples from your question, the star repeats character classes.
The patterns are bracketed with ^ and $, which means an entire string must match rather than some substring to succeed.
The i at the end, after the closing curly brace, is a regex switch that makes the pattern case-insensitive. As TLP helpfully points out in the comment below, this makes the delimiters or terminators match without regard to case if they contain letters.
The expression $identity =~ $pattern tests whether the compiled regex stored in $pattern (created with $pattern = qr{...}) matches the text in $identity. As written above, it is likely being evaluated for its side effect of storing capture groups in $1, $2, etc. This is a red flag. Never use $1 and friends unconditionally but instead write
if ($identity =~ $pattern) {
print $1, "\n"; # for example
}

Removing repeated characters, including spaces, in one line

I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!

This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)

You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js