I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...
Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
(\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
(?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
Then put back the character we captured with $1
It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.
Related
The regex s/\A\s*\n// removes every all-whitespace line from the beginning of a string.
It leaves everything else alone, including any whitespace that might begin the first visible line.
By "visible line," I mean a line that satisfies /\S/.
The code below demonstrates this.
But how does it work?
\A anchors the start of the string
\s* greedily grabs all whitespace. But without the (?s) modifier, it should stop at the end of the first line, should it not?
See
https://perldoc.perl.org/perlre.
Suppose that without the (?s) modifier it nevertheless "treats the string as a single line".
Then I would expect the greedy \s* to grab every whitespace character it sees,
including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.
Nevertheless, the code does exactly what I want. Since I can't explain it, it's like a kludge, something that happens to work, discovered through trial and error. What is the reason it works?
#!/usr/bin/env perl
use strict; use warnings;
print $^V; print "\n";
my #strs=(
join('',"\n", "\t", ' ', "\n", "\t", ' dogs',),
join('',
"\n",
"\n\t\t\x20",
"\n\t\t\x20",
'......so what?',
"\n\t\t\x20",
),
);
my $count=0;
for my $onestring(#strs)
{
$count++;
print "\n$count ------------------------------------------\n";
print "|$onestring|\n";
(my $try1=$onestring)=~s/\A\s*\n//;
print "|$try1|\n";
}
But how does it work?
...
I would expect the greedy \s* to grab every whitespace character it sees, including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.
Correct -- the \s* at first grabs everything up to the d (in dogs) and with that the match would fail ... so it backs up, a character at a time, shortening that greedy grab so to give a chance to the following pattern, here \n, to match.
And that works! So \s* matches up to (the last!) \n, that one is matched by the following \n in the pattern, and all is well. That's removed and we stay with "\tdogs" which is printed.
This is called backtracking. See about it also in perlretut. Backtracking can be suppressed, most notably by possesive forms (like \w++ etc), or rather by extended construct (?>...).
But without the (?s) modifier, it should stop at the end of the first line, should it not?
Here you may be confusing \s with ., which indeed does not match \n (without /s)
There are two questions here.
The first is about the interaction of \s and (lack of) (?s). Quite simply, there is no interaction.
\s matches whitespaces characters, which includes Line Feed (LF). It's not affected by (?s) whatsoever.
(?s) exclusively affects ..
(?-s) causes . to match all characters except LF. [Default]
(?s) causes . to match all characters.
If one wanted to match whitespace on the current line, one could use \h instead of \s. It only matches horizontal whitespace, thus excluding CR and LF (among others).
Alternatively, (?[ \s - \n ])[1], [^\S\n][2] and \s(?<!\n)[3] all match whitespace characters other than LF.
The second is about a misconception of what greediness means.
Greediness or lack thereof doesn't affect if a pattern can match, just what it matches. For example, for a given input, /a+/ and /a+?/ will both match, or neither will match. It's impossible for one to match and not the other.
"aaaa" =~ /a+/ # Matches 4 characters at position 0.
"aaaa" =~ /a+?/ # Matches 1 character at position 0.
"bbbb" =~ /a+/ # Doesn't match.
"bbbb" =~ /a+?/ # Doesn't match.
When something is greedy, it means it will match the most possible at the current position that allows the entire pattern to match. Take the following for example:
"ccccd" =~ /.*d/
This pattern can match by having .* match only cccc instead of ccccd, and thus does so. This is achieved through backtracking. .* initially matches ccccd, then it discovers that d doesn't match, so .* tries matching only cccc. This allows the d and thus the entire pattern to match.
You'll find backtracking used outside of greediness too. "efg" =~ /^(e|.f)g/ matches because it tries the second alternative when it's unable to match g when using the first alternative.
In the same way as .* avoids matching the d in the earlier example, the \s* avoids matching the LF and tab before dog in your example.
Requires use experimental qw( regex_sets ); before 5.36, but it was safe to use since 5.18 as it was accepted without change since its introduction as an experimental feature..
Less clear because it uses double negatives.[^\S\n]= A char that's ( not( not(\s) or LF ) )= A char that's ( not(not(\s)) and not(LF) )= A char that's ( \s and not LF )
Less efficient, and far from as pretty as the regex set.
I have found one method, but I don't understand the principle:
#remove lines starting with //
$file =~ s/(?<=\n)[ \t]*?\/\/.*?\n//sg;
How does (?<=\n)[ \t]*? work?
The critical piece is the lookbehind (?<=...). It is a zero-width assertion, what means that it does not consume its match -- it only asserts that the pattern given inside is indeed in the string, right before the pattern that follows it.
So (?<=\n)[ \t] matches either a space or a tab, [ \t], that has a newline before it. With the quantifier, [ \t]*, it matches a space-or-tab any number of times (possibly zero). Then we have the // (each escaped by \). Then it matches any character any number of times up to the first newline, .*?\n.
Here ? makes .* non-greedy so that it stops at the first match of the following pattern.
This can be done in other ways, too.
$file =~ s{ ^ \s* // .*? \n }{}gmx
The modifier m makes anchors ^ and $ (unused here) match the beginning and end of each line. I use {}{} as delimiters so that I don't have to escape /. The modifier x allows use of spaces (and comments and newlines) inside for readability.
You can also do it by split-ing the string by newline and passing lines through grep
my $new_file = join '\n', grep { not m|^\s*//.*| } split /\n/, $file;
The split returns a list of lines and this is input for grep, which passes those for which the code in the block evaluates to true. The list that it returns is then joined back, if you wish to again have a multiline string.
If you want lines remove join '\n' and assign to an array instead.
The regex in the grep block is now far simpler, but the whole thing may be an eye-full in comparison with the previous regex. However, this approach can turn hard jobs into easy ones: instead of going for a monster master regex, break the string and process the pieces easily.
I am attempting to replace the spaces in my string with an under-bar. With my limited coding experience, I have come up with this -
s/\b[ ]\D/_/g
This command works in finding all of the appropriate selections of my file however, it replaces the space and the proceeding character rather than only the space. How can I insure it only replaces the whitespaces and no additional characters?
Also, I would not like this to affect number characters (hence the \D).
The regex \b[ ]\D (which could also be written as \b \D, by the way) matches the space and the following non-digit character, so that's what's replaced with an underscore.
There are two (well, there are more, but these two are the straightforward ones) ways go go about fixing this in Perl:
With a capture group and back reference:
s/\b (\D)/_\1/g
Here the regex will still match the space and the non-digit character, but the non-digit character will be remembered as \1 and used as part of the replacement.
With a lookahead zero-length assertion:
s/\b (?=\D)/_/g
(?=\D) matches the empty string if (and only if) it is followed by something matching \D, so the non-digit character is no longer part of the match and is not replaced.
Addendum: By the way, I suspect you meant to use \b\D instead of just \D. \D matches spaces (because they are not digits), therefore
$ echo 'foo 123 bar baz qux' | perl -pe 's/\b (?=\D)/_/g'
foo 123_bar_ baz_qux
as opposed to
$ echo 'foo 123 bar baz qux' | perl -pe 's/\b (?=\b\D)/_/g'
foo 123_bar baz_qux
Try
s/\s/_/g
The \s is the character that will match all whitespace.
If you are worried about abutting spaces use \s+
the + means 1 or more whitespace characters.
Let's say you have some lines that look like this
1 int some_function() {
2 int x = 3; // Some silly comment
And so on. The indentation is done with spaces, and each indent is two spaces.
You want to change each indent to be three spaces. The simple regex
s/ {2}/ /g
Doesn't work for you, because that changes some non-indent spaces; in this case it changes the two spaces before // Some silly comment into three spaces, which is not desired. (This gets far worse if there are tables or comments aligned at the back end of the line.)
You can't simply use
/^( {2})+/
Because what would you replace it with? I don't know of an easy way to find out how many times a + was matched in a regex, so we have no idea how many altered indents to insert.
You could always go line-by-line and cut off the indents, measure them, build a new indent string, and tack it onto the line, but it would be oh so much simpler if there was a regex.
Is there a regular expression to replace indent levels as described above?
In some regex flavors, you can use a lookbehind:
s/(?<=^ *) / /g
In all other flavors, you can reverse the string, use a lookahead (which all flavors support) and reverse again:
s/ (?= *$)/ /g
Here's another one, instead utilizing \G which has NET, PCRE (C, PHP, R…), Java, Perl and Ruby support:
s/(^|\G) {2}/ /g
\G [...] can match at one of two positions:
✽ The beginning of the string,
✽ The position that immediately follows the end of the previous match.
Source: http://www.rexegg.com/regex-anchors.html#G
We utilize its ability to match at the position that immediately follows the end of the previous match, which in this case will be at the start of a line, followed by 2 whitespaces (OR a previous match following the aforementioned rule).
See example: https://regex101.com/r/qY6dS0/1
I needed to halve the amount of spaces on indentation. That is, if indentation was 4 spaces, I needed to change it to 2 spaces.
I couldn't come up with a regex. But, thankfully, someone else did:
//search for
^( +)\1
//replace with (or \1, in some programs, like geany)
$1
From source: "^( +)\1 means "any nonzero-length sequence of spaces at the start of the line, followed by the same sequence of spaces. The \1 in the pattern, and the $1 in the replacement, are both back-references to the initial sequence of spaces. Result: indentation halved."
You can try this:
^(\s{2})|((?<=\n(\s)+))(\s{2})
Breakdown:
^(\s{2}) = Searches for two spaces at the beginning of the line
((?<=\n(\s)+))(\s{2}) = Searches for two spaces
but only if a new line followed by any number of spaces is in front of it.
(This prevents two spaces within the line being replaced)
I'm not completely familiar with perl, but I would try this to see if it work:
s/^(\s{2})|((?<=\n(\s)+))(\s{2})/\s\s\s/g
As #Jan pointed out, there can be other non-space whitespace characters. If that is an issue, try this:
s/^( {2})|((?<=\n( )+))( {2})/ /g
I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!
This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)
You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.