Delete any non-word character but spaces and a single quote inside the word - regex

Want cleanup some texts. So, want remove anything but \w and \s, but also want keep the single ' inside the word. (e.g. want keep it in words like don't.
I could do
perl -plE "s/[^\w\s']//g" <<< "'a:b/c d????ef' don't"
which keeps the ' but it keeps it also at the begining or end of string, e.g. it prints
'abc def' don't
I'm unable to implement the keep this (?<\w)'(?=\w), e.g. remove the ' unless it is between two word characters.
The wanted result:
abc def don't
How to do this?

You could do this:
s/[^\w\s']|(?<!\w)'|'(?!\w)//g
Delete everything that is either
a character that is not (a word character or a space or '), or
a ' that is not preceded by a word character, or
a ' that is not followed by a word character
The first clause will match (and remove) all characters that we obviously don't want to keep.
The second and third clause will remove all ' characters unless they're surrounded by word characters on both sides.

You can also use a global research instead of a replacement, this way you only have to describe what you want to keep and the pattern becomes more simple:
perl -ne"print /[\w\s]|\b'\b/g" <<< "'a:b/c d????ef' don't"

Related

Eliminate whitespace around single letters

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...
Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
(\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
(?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
Then put back the character we captured with $1
It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

Perl Split using "*"

If I use split like this:
my #split = split(/\s*/, $line);
print "$split[1]\n";
with input:
cat dog
I get:
a
However if I use \s+ in split, I get:
dog
I'm curious as to why they don't produce the same result? Also, what is the proper way to split a string by character?
Thanks for your help.
\s* effectively means zero or more whitespace characters. Between c and a in cat are zero spaces, yielding the result you're seeing.
To the regex engine, your string looks as follows:
c
zero spaces
a
zero spaces
t
multiple spaces
d
zero spaces
o
zero spaces
g
Following this logic, if you use \s+ as a separator, it will only match the multiple spaces between cat and dog.
* matches 0 or more times. Which means it can match the empty string between characters. + matches 1 or more times, which means it must match at least one character.
This is described in the documentation for split:
If PATTERN matches the empty string, the EXPR is split at the match position (between characters).
Additionally, when you split on whitespace, most of the time you really want to use a literal space:
.. split ' ', $line;
As described here:
As another special case, "split" emulates the default behavior of the
command line tool awk when the PATTERN is either omitted or a literal
string composed of a single space character (such as ' ' or "\x20",
but not e.g. "/ /"). In this case, any leading whitespace in EXPR is
removed before splitting occurs, and the PATTERN is instead treated as
if it were "/\s+/"; in particular, this means that any contiguous
whitespace (not just a single space character) is used as a separator.
However, this special treatment can be avoided by specifying the
pattern "/ /" instead of the string " ", thereby allowing only a
single space character to be a separator.
If you want to split a string into a list of individual characters then you should use an empty regex pattern for split, like this
my $line = 'cat';
my #split = split //, $line;
print "$_\n" for #split;
output
c
a
t
Some people prefer unpack, like this
my #split = unpack '(A1)*', $line;
which gives exactly the same result.

Removing repeated characters, including spaces, in one line

I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!
This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)
You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.

Specify a "-" in a sed pattern

I'm trying to find a '-' character in a line, but this character is used for specifying a range. Can I get an example of a sed pattern that will contain the '-' character?
Also, it would be quicker if I could use a pattern that includes all characters except a space and a tab.
'-' specifies a range only between square brackets.
For example, this:
sed -n '/-/p'
prints all lines containing a '-' character. If you want a '-' to represent itself between square brackets, put it immediately after the [ or before the ]. This:
sed -n '/[-x]/p'
prints all lines containing either a '-' or a 'x'.
This pattern:
[^ <tab>]
matches all characters other than a space and a tab (note that you need a literal tab character, not "<tab>").
If you want to find a dash, specify it outside a character class:
/-/
If you want to include it in a character class, make it the first or last character:
/[-a-z]/
/[a-z-]/
If you want to find anything except a blank or tab (or newline), then:
/[^ \t]/
where, I hasten to add, the '\t' is a literal tab character and not backslash-t.
find "-" in file
awk '/-/' file

PERL-Subsitute any non alphanumerical character to "_"

In perl I want to substitute any character not [A-Z]i or [0-9] and replace it with "_" but only if this non alphanumerical character occurs between two alphanumerical characters. I do not want to touch non-alphanumericals at the beginning or end of the string.
I know enough regex to replace them, just not to only replace ones in the middle of the string.
s/(\p{Alnum})\P{Alnum}(\p{Alnum})/${1}_${2}/g;
Of course that would hurt your chanches with "#A#B%C", so you might use a look-arounds:
s/(?<=\p{Alnum})\P{Alnum}(?=\p{Alnum})/_/g;
That way you isolate it to just the non "alnum" character.
Or you could use the "keep flag", as well and get the same thing done.
s/\p{Alnum}\K\P{Alnum}(?=\p{Alnum})/_/g;
EDIT based on input:
To not eat a newline, you could do the following:
s/\p{Alnum}\K[^\p{Alnum}\n](?=\p{Alnum})/_/g;
Try this:
my $str = 'a-2=c+a()_';
$str =~ s/(?<=[A-Z0-9])[^A-Z0-9](?=[A-Z0-9])/\1_\2/gi;