Trim whitespace from middle of string - regex

I'm using the following regex to capture a fixed width "description" field that is always 50 characters long:
(?.{50})
My problem is that the descriptions sometimes contain a lot of whitespace, e.g.
"FLUID COMPRESSOR "
Can somebody provide a regex that:
Trims all whitespace off the end
Collapses any whitespace in between words to a single space

Substitute two or more spaces for one space:
s/ +/ /g
Edit: for any white space (not just spaces) you can use \s if you're using a perl-compatible regex library, and the curly brace syntax for number of occurrences, e.g.
s/\s\s+/ /g
or
s/\s{2,}/ /g
Edit #2: forgot the /g global suffix, thanks JL

str = Regex.Replace(str, " +( |$)", "$1");

Perl-variants:
1) s/\s+$//;
2) s/\s+/ /g;

C#:
Only if you wanna trim all the white spaces - at the start, end and middle.
string x = Regex.Replace(x, #"\s+", " ").Trim();

Is there a particular reason you are asking for a regular expression? They may not be the best tool for this task.
A replacement like
s/[ \t]+/ /g
should compress the internal whitespace (actually, it will compress leading and trailing whitespace too, but it doesn't sound like that is a problem.), and
s/[ \t]+$/$/
will take care of the trailing whitespace. [I'm using sedish syntax here. You didn't say what flavor you prefer.]
Right off hand I don't see a way to do it in a single expression.

Since compressing whitespace and trimming whitespace around the edges are conceptually different operations, I like doing it in two steps:
re.replace("s/\s+/ /g", str.strip())
Not the most efficient, but quite readable.

/(^[\s\t]+|[\s\t]+([\s\t]|$))/g replace with $2 (beginning|middle/end)

Related

Eliminate whitespace around single letters

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...
Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
(\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
(?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
Then put back the character we captured with $1
It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

Substitution with \s does not work as expected

I write regex to remove more than 1 space in a string. The code is simple:
my $string = 'A string has more than 1 space';
$string = s/\s+/\s/g;
But, the result is something bad: 'Asstringshassmoresthans1sspace'. It replaces every single space with 's' character.
There's a work around is instead of using \s for substitution, I use ' '. So the regex becomes:
$string = s/\s+/ /g;
Why doesn't the regex with \s work?
\s is only a metacharacter in a regular expression (and it matches more than just a space, for example tabs, linebreak and form feed characters), not in a replacement string. Use a simple space (as you already did) if you want to replace all whitespace by a single space:
$string = s/\s+/ /g;
If you only want to affect actual space characters, use
$string = s/ {2,}/ /g;
(no need to replace single spaces with themselves).
The answer to your question is that \s is a character class, not a literal character. Just as \w represents alphanumeric characters, it cannot be used to print an alphanumeric character (except w, which it will print, but that's beside the point).
What I would do, if I wanted to preserve the type of whitespace matched, would be:
s/\s\K\s*//g
The \K (keep) escape sequence will keep the initial whitespace character from being removed, but all subsequent whitespace will be removed. If you do not care about preserving the type of whitespace, the solution already given by Tim is the way to go, i.e.:
s/\s+/ /g
\s stands for matching any whitespace. It's equivalent to this:
[\ \t\r\n\f]
When you replace with $string = s/\s+/\s/g;, you are replacing one or more whitespace characters with the letter s. Here's a link for reference: http://perldoc.perl.org/perlrequick.html
Why doesn't the regex with \s work?
Your regex with \s does work. What doesn't work is your replacement string. And, of course, as others have pointed out, it shouldn't.
People get confused about the substitution operator (s/.../.../). Often I find people think of the whole operator as "a regex". But it's not, it's an operator that takes two arguments (or operands).
The first operand (between the first and second delimiters) is interpreted as a regex. The second operand (between the second and third delimiters) is interpreted as a double-quoted string (of course, the /e option changes that slightly).
So a substitution operation looks like this:
s/REGEX/REPLACEMENT STRING/
The regex recognises special characters like ^ and + and \s. The replacement string doesn't.
If people stopped misunderstanding how the substitution operator is made up, they might stop expecting regex features to work outside of regular expressions :-)

Regex to change the number of spaces in an indent level

Let's say you have some lines that look like this
1 int some_function() {
2 int x = 3; // Some silly comment
And so on. The indentation is done with spaces, and each indent is two spaces.
You want to change each indent to be three spaces. The simple regex
s/ {2}/ /g
Doesn't work for you, because that changes some non-indent spaces; in this case it changes the two spaces before // Some silly comment into three spaces, which is not desired. (This gets far worse if there are tables or comments aligned at the back end of the line.)
You can't simply use
/^( {2})+/
Because what would you replace it with? I don't know of an easy way to find out how many times a + was matched in a regex, so we have no idea how many altered indents to insert.
You could always go line-by-line and cut off the indents, measure them, build a new indent string, and tack it onto the line, but it would be oh so much simpler if there was a regex.
Is there a regular expression to replace indent levels as described above?
In some regex flavors, you can use a lookbehind:
s/(?<=^ *) / /g
In all other flavors, you can reverse the string, use a lookahead (which all flavors support) and reverse again:
s/ (?= *$)/ /g
Here's another one, instead utilizing \G which has NET, PCRE (C, PHP, R…), Java, Perl and Ruby support:
s/(^|\G) {2}/ /g
\G [...] can match at one of two positions:
✽ The beginning of the string,
✽ The position that immediately follows the end of the previous match.
Source: http://www.rexegg.com/regex-anchors.html#G
We utilize its ability to match at the position that immediately follows the end of the previous match, which in this case will be at the start of a line, followed by 2 whitespaces (OR a previous match following the aforementioned rule).
See example: https://regex101.com/r/qY6dS0/1
I needed to halve the amount of spaces on indentation. That is, if indentation was 4 spaces, I needed to change it to 2 spaces.
I couldn't come up with a regex. But, thankfully, someone else did:
//search for
^( +)\1
//replace with (or \1, in some programs, like geany)
$1
From source: "^( +)\1 means "any nonzero-length sequence of spaces at the start of the line, followed by the same sequence of spaces. The \1 in the pattern, and the $1 in the replacement, are both back-references to the initial sequence of spaces. Result: indentation halved."
You can try this:
^(\s{2})|((?<=\n(\s)+))(\s{2})
Breakdown:
^(\s{2}) = Searches for two spaces at the beginning of the line
((?<=\n(\s)+))(\s{2}) = Searches for two spaces
but only if a new line followed by any number of spaces is in front of it.
(This prevents two spaces within the line being replaced)
I'm not completely familiar with perl, but I would try this to see if it work:
s/^(\s{2})|((?<=\n(\s)+))(\s{2})/\s\s\s/g
As #Jan pointed out, there can be other non-space whitespace characters. If that is an issue, try this:
s/^( {2})|((?<=\n( )+))( {2})/ /g

Adding a space character to my regex

I would like some help in getting this regex to accept the space character.
The following regex works ^a|a$|a but this one doesn't ^tips to|tips to$|tips to.
Space is just as-is in a regex (you just put the space character, that should work). Alternatively you can use \s special character. For example, in Perl:
my $test = "Helloworld";
if ($test =~ m/ /)
{
print("Has space\n");
}
Also if you can specify more what you want to use the regex for, we might be able to help better.
try escaping just the last space (the regex engine will then "see" that "tips to" is one block - at least for the last OR)
^tips to|tips to$|tips\ to
or to be on the safe side group what your searching for
^(tips to)|(tips to)$|(tips to)
[EDIT 1]
so here's the solution the OP is using:
^"tips to"|"tips to"$|"tips to"
The regular expression that matches 1 space character is 1 space character.

PERL-Subsitute any non alphanumerical character to "_"

In perl I want to substitute any character not [A-Z]i or [0-9] and replace it with "_" but only if this non alphanumerical character occurs between two alphanumerical characters. I do not want to touch non-alphanumericals at the beginning or end of the string.
I know enough regex to replace them, just not to only replace ones in the middle of the string.
s/(\p{Alnum})\P{Alnum}(\p{Alnum})/${1}_${2}/g;
Of course that would hurt your chanches with "#A#B%C", so you might use a look-arounds:
s/(?<=\p{Alnum})\P{Alnum}(?=\p{Alnum})/_/g;
That way you isolate it to just the non "alnum" character.
Or you could use the "keep flag", as well and get the same thing done.
s/\p{Alnum}\K\P{Alnum}(?=\p{Alnum})/_/g;
EDIT based on input:
To not eat a newline, you could do the following:
s/\p{Alnum}\K[^\p{Alnum}\n](?=\p{Alnum})/_/g;
Try this:
my $str = 'a-2=c+a()_';
$str =~ s/(?<=[A-Z0-9])[^A-Z0-9](?=[A-Z0-9])/\1_\2/gi;